Curate your data at scale-CSDN博客

本文链接：https://blog.csdn.net/just2gooo/article/details/134832652

My name is Mahesh Misra. I'm a Principal Product Manager with AWS Lake Formation with me. I have a customer, Gerry Moses who comes from the Amazon.com side who uh owns the data lake uh platform for Amazon.com. We also have a big data architect from AWS Lake Formation team at this university.

Presenting to you today, how to curate data at scale.

Let's review the agenda for today. uh we have um what is data curation uh as first topic, we'll talk about what we mean by that, how, how we look at it, how, how to think about it and why it is at an enterprise scale, why it is so hard, right? and, and probably provide some best practices around how to think about it the right way in aws way.

Then Jerry will talk about their journey using AWS technology and uh um and processes uh to to curate data at scale.

And um Ati will show some cool demo about how AWS manage services can make it easy for you to manage data.

So before we dive in, let's look at some interesting numbers. This is a survey data that was published last year from SVR the Harvard Business Review which says more than 90% of the enterprise customers, Fortune 1000 customers or Fortune 1000 companies are investing in data modernization and their data modernization practices are growing each year.

But the same set of companies when they surveyed, do they really believe that they are getting more data driven? They found that less than 25% of those companies believe that they are data driven.

Now, there is a key question here that is where is the disparity here? Like why it is too hard for customers who are investing in data? The data volumes are growing, the data infrastructure costs are growing but they're not becoming data driven.

So before we go and then talk about those problems, challenges that they have and talk about how to solve it. Let's understand what this term called data curation is the way i define data curation. It's a five stage process that is intended to break three types of silos. One is data silo, the other one is system silo. The third one is people silo.

The five stage processes start with identifying your source systems and integrating and centralizing them at one place like a data lake or a data warehouse. And then you govern them, define guard rail security controls around them and share with them with your users.

And now think about this, why when you look at identify your sources and integrate them, those are essentially breaking down the data silos because your source systems are discrete and you are you are integrating them at one place. You are breaking down the system, centralizing them at one place is what that means is you, you are creating standardized storage, standardized way to access the data. You are breaking down the data silo.

Then then the last and most important thing is you are connecting your data with your people through governance and data sharing capabilities. So you are breaking down your people and knowledge silo.

So when you can achieve all these five things, you say you are really curating that art scale.

Why so many so many companies are struggling to curate data at scale today. The first challenge that we see is first of all their source system, the point of data generation, the otp systems, they are siloed customers today have their data in different solutions like ps with cr ms, they have data in different micro microservices, mobile and web applications. And then when they want to do analysis, they want those data together. It's very hard when your data is sitting in multiple different systems because they they support different storage layer, different interfaces, different security model, bunch of limitations there.

Then you might be thinking, oh there is something known as etl, we have been doing that for years, we can do it right? Etl is hard too. So when you are doing data integration, first of all, when your sources are so discreet, you need to be able to connect to those data sources to get the data out. You need to understand their authentication authorization model, create right kind of connections and models to operate against the data in the source system.

Then you need to build custom etl pipelines which are like a bunch of scripts that you write. Or if you have a u i based tool, you should be able to create um you know, etl pipelines using those tools. And guess what? After that, your problem is not solved because as your retail jobs fail, you need to recover them from failure, you need to monitor them. And as your data volume is growing, you are you are going to do a lot of undifferentiated heavy lifting to be able to, you know, scale your infrastructure in a way that it will support your growing data volume.

So you did all that like moves on. How do you connect your data with your people? Governance is hard too? When you start your governance journey, the way i describe define governance, it is it is like infusing trust in data and defining gus and controls around data so that you set your users free to do data driven analysis. And when you start your uh data governance journey, essentially the first thing that you want to do is know what exists, what data do i have and what controls do i need. That means you need to curate a lot of metadata.

Your source systems are growing, your data, data data sets are growing. You have a third problem, your metadata is growing too and when your metadata is growing, you end up having you know a massive amount of metadata. If you want to collect all kinds of metadata. For example, if you want to store schema definition of your of your data sets, you want to store data types, you want to store things like data profile, you want to classify your data sets, you want to define ontology, you want to build data products on top of it. These are harder problems you end up having in a place where you will spend multiple months mu multiple teams of engineering effort to be able to curate data at that scale.

So that's your problem. um When your metadata is broken, everything around it is also broken. You, you create wrong policies, you don't know which one is your pi i data. So definitely your policies are wrong. So you your guard rails depend on your metadata. So data governance itself itself is a difficult uh challenge for customers.

Now look at you know all these problems you are, you are facing those problems every day. Now look at how to think about it the right way.

So when you start your data curation journey, the first thing that i tell all my customers is hey, think long term work backwards from that. So when you think long term, the first choice that you make is the right technology choice and the right technology partner like AWS, AWS will give you manage capabilities for metadata harvesting, for data classification, for data quality checks, for etl, for uh for things like data cataloging and multiple different suite of products that are available to you.

So if for example, a sim, let's take the example of glue, glue itself is going to give you all etl capability, etl data quality, which um uh data quality tools which rti is going to present later today, you have sensitive data detection so that you can automatically detect and get your sensitive data and register that in a catalog. So first thing you've made a technology choice, right? You you chose your partner and then you have all the automation tools that are available readily from your partner to automate your data discovery and metadata management. You are not doing it anymore manually.

Earlier today we announced j a metadata harvesting on datazone. So you can build all this cap. You can use all these capabilities to be able to automate your data discovery and metadata management.

Then second important thing you you have now you know metadata harvesting, but you need to store it somewhere so your customers can see them, right? So catalog all your data assets, including your metadata, including your data, including everything around it. The onto is create business models on top of it and catalog them at one single place. And, and because your metadata is clean, it's coming through automated processes. Your entitlements will be clean, use a centralized service like la formation or datazone to be able to define those entitlements and share data set your users free and periodically audit them.

So that you understand what is the health of your architecture, how, how your governance is working, what data is being used, what data is being not what policies exist, which are still and you keep you know, improving your system eventually. Ok? You made the technology choice. But now the second choice that you want to make you choose the right architecture, i always say take a long term view, but you cannot get to the long term view. If you don't take the first step, the first step is you choose the right architecture which is simple and evolve from there.

If you are starting now or the first, if your data journey is starting now, you start with a very simple architecture, one single data catalog, one single account, all data registered there and you know your your customers can consume, your employees can consume from the same data catalog and then evolve it federate that you have 101 central account where all your data sets then then your consumers can be separated in different accounts. And if you want to go, if you are mature enough as a as a data team, you have you have fully decentralized practices. Go for data mess, you orient your data data sets as data products and then have peer to peer data set.

You chose your, you made a decision, technology decision, you chose an architecture. Will that solve all the problems? Didn't we say in the beginning that 90% of the companies are spending in technology data lake modernization but not they are not becoming data driven. So what are we missing here? We are missing the papal aspect of it.

So, so you need to invest in your people. The way to think about this is don't align your teams based on what they do. This is my data engineering team wrong. This is my business intelligence team wrong. This is my customer data ownership team. That's right. They take care of their, take, take care of their consumers, they, they build it, they maintain it, they take decisions independently and they collaborate among themselves to give the right data product and data experience to the consumers who are going to use the data at scale with this.

I'm going to invite Jerry, um who is from Amazon.com, a valued customer to come on the stage and talk about their journey of their decoration.

Thank you.

Hello everyone. So some say data is gold but only if you can find it, you can hold on to it and you can keep it flowing in. It's all very similar. Protecting data is more than just preventing bad actors from getting at it. It's also about doing it in a way that doesn't hamstring your business and also in a way that your customers trust you, that you're doing the right thing and continue doing business with you. What i want to talk to you today is the journey of the amazon store devices and other business in creating aaa program in process for protecting our data. And hopefully you can learn a little bit from our journey.

So first a little bit about us, i said we are the amazon store data and other business. Think amazon.com data leak. That's what we are. We have a very large amount of data. I put some, i put some numbers on the slide. But the thing about it is the sheer volume of data, the number of data sets and the number of teams is really forcing us to think about doing this, having automated processes and scalable processes for governance manual just will not cut it with this particular scale.

Our government's problem areas are very similar to what you saw earlier. We have this entire list. Each one of these things is probably one of these presentations unto itself today. I'm going to talk to you about fine grain access control and a little bit about consumption in multiple query engines.

So first, in order to explain the problem, i want to talk to you a little bit about how we were doing things so compliance and protecting data is just, it's not optional, you have to find a way to do it. And so in a world where there were no fine grain access controls, there was full data access. We had to do something. And so what dataset owners were doing is if we had to basically publish a slice of data, perhaps only redact certain rows, maybe certain rows represent a partner and they're sensitive. They would do something like my table policy. One, if the the table you know contains sensitive columns, maybe child data, maybe p we would do something like policy two. What we've now done is created three copies of the table

"And as long also processes that actually need to run and keep each of these tables in sync. So now we have operational overhead, we have an additional storage cost.

Then on the other side, our cust our consumers also want to, we're making copies of data to use um aws compute red shift was probably the biggest that's 90% of our, you know, compute capacity. And for every redshift local clu cluster, we had to make another local copy. And so we had to actually build a process, we modeled as a subscription that would synchronize these tables up. And for our popular data sets, that's 2 to 600 copies of a data set. And so now this whole process is ballooning out. So we've got all this redundant storage and the biggest concern is how do we know like we're doing the right permissions. We have all the right permissions and we have the right pension policies and all of these copies. And most importantly, how do we audit this? So, auditing now is super complex. It's not one thing, it's now potentially 603 things. And so we need to think of a better way and so that better way is what we're calling fine grain access controls.

And I'm going to talk to you a little bit about the two key aws technologies that we used. I'm going to dig into that a little bit about what we did. And some of our, the things that we discovered along the way and hopefully, that's where the learnings from this are going to come from.

So the first key technology is aws lake formation. Aws lake formation provides the fine grain access controls with that. There's glue catalog federation. And so you heard the talk earlier, metadata gets big. We didn't want to have to do a big migration project to move all of our data into the glue catalog and then change all of our tooling to work with glue. So instead we use catalog federation that sort of prevents all this. I'll explain how it works. In a second, the second piece we use is amazon data share integrated with lake formation.

So the first step sort of gives us all of the glue query engines so athena emr glu l and then to bring the lead piece in, we needed to do the data share and I'll talk a little bit about this. So how did we achieve the first place? This is a highly highly simplified diagram of a really complex system. And what i really show you here is just our integration points with glue. Here is what provides the, the schema to query engines and also lake formation for implementing fine grain access controls.

We didn't want to copy data. So we set up a federation service and what that does is essentially glue delegates api calls to our service. And our service essentially provides all the data that's required for that. api we get to keep our catalog, we get to keep our business processes and then everything works. Lake formation is providing the fine grade access controls.

So it has a data cell filter which is a data construct which defines the rows and columns. A user has access to and then there are grants on the data cell filter. We have a permission service which sits at the end of our permissions workflow which basically provisions the data cell filter and the grant all of this lives in central account. You're hearing about the architecture. is this going to be multiple account or single account?

Well, the core data catalog sits in a single account. We call it andy, that's our brand, but we have thousands of teams all of them have their own customer accounts. And so we have to have a way of actually, you know, supporting, you know, thousands and tens of thousands of cross account use cases. And that's what resource linking is. And what we essentially do is we create a link between a customer's glue catalog, our glue catalog. And at query time, the engine basically follows the resource link gets the metadata which is essentially brokered by our data. our federation engine then contacts late formation gets the authorized rows and columns and then executes the query effectively. And so those are the integration points.

So that's like the raw mechanics for how to make it work. but there's more to the story. So there this is where we're getting a little bit into our workflows. So the first persona is essentially a dataset owner and we with fine grain access control, we introduced the concept of tagging. So how do you know what policies are associated with what rows and columns? And so a dataset owner has to associate a tag with a row or a column with a row we define as a row classifier. So that's a sequel expression or they can associate with an attribute. So that's that one link and then tags are also associated with what we call a data policy. And the data policy essentially represents the enforcement as well as the permissions workflow that goes with it.

And so with all of this together with all of this mapping, we can now we actually now know what permissions ux to present when someone's requesting permissions for this table and our services know how to provision a data cell filter. When a request is approved, what i describe here is a mechanism and this is just a technology mechanism. The act of tagging is a whole problem area unto itself which many people are talking about today. How do you go about finding the right tags for the right attributes at scale? And that's where a number of these processes. are there a i models? Are there simple tools. We started with simple tools and manual processes. But ultimately, there's this is an area alone that requires much more development.

So switching over to the consumer side. So the first thing we want to talk about is essentially the permissions workflow and we had one before and it was find the table, provide the identity and request permission, full access and when it comes in, you can go here, things are a little bit different with fine grade access controls and that we're not giving full access. We thought we agonized over this ux for a very, very long time. We were thinking about where are we going to do a column picker? Where are we going to allow people to put their own road classifiers into it? And we said in the end, we decided that's all too complex, at least for starting out so when we went with the idea of just naming our data access policy.

So remember the slide before where we associated a table with a policy, the policy has a name. And so what you see here is the default policy, which is basically everything that's not associated with a sensitive policy, data access policy as we lay around more. And there will be a region one later, you can actually see that this data set will have two policies associated with it. But the trick is we name them and that provides the simplifications. We'll get to later.

The last part we need to set up is the cross-account permissions management. We had the concept of subscription before this is how we set up our synchronization of red shift. So we extended this to the glue concepts. And so rather than setting up a redshift synchronizer, we set up the resource link. So create the link grant, all the cross-account permissions, all that's done for you. You don't mess it up. And this is a really easy step to mess up manually. So that's all well and good. But this particular mechanism does a lot more by having a subscription, we can do things for, for use for consumers on their behalf such as if a version changes, can we do an orchestrated update or like a schema changes or permissions change? So we can actually now control that rather than chaos ensuing when something changes with subscriptions we also know who to talk to if something's going to change.

So once we had all of this, these pieces together, in essence, we have fine grain access controls for glue bases services. So athena emr sagemaker, glue etl, that's all working with all this great stuff. There are some technical challenges and I'm going to go through them quickly.

So the first one is sort of non intuitive and this is one we were like, no, we don't really need to worry about this problem, but we did. And the problem we have here is there are three different components in the system that have different views of your metadata. And the example i'm showing is data type. So we have our internal catalog, we have glue and then we have files created by our tooling and they all have their own slightly different nuance of what the individual data types are. And so we spend a lot of time, you know, basically building this list of all the pattern permutations and it turned out it wasn't, wasn't that big. But what this table help us do is proactively identify where are things going to break or where are we going to have an issue where a performance feature isn't going to be used. And so this really actually helped us derisk the program early and turn it into more of a proactive, we know what we have to do rather than reactive, peel the onion.

This next area performance alone is a huge discussion. And so i'm going to talk about one of the features, but this is the thing that users were, this is the single biggest thing that users were terrified of when we talk fine grade access controls, what's going to happen to performance and it's probably not going to be as good as it was when you had like all this direct access.

So the first feature we went after which was a glue feature is called predicate pushdown. And what we wanted to make sure is our fine grade access control features work with this. And the way this feature works, imagine a table, my table, you have three partitions and the partitions all have files associated with them. And what happens with this feature is if you pass a predicate into the engine that's sort of aligned with the partition key, the engine only has to read the partitions that are associated or the files that are associated with that partition. Example, one, we only have to read files one through n if you look at example two, we don't have that match. And now the engine has to read all of the essentially all of the files.

The takeaway here is when you're planning your fine grain access controls, you need to think about the performance features of the query engines and make them line up, you know, are your fine grain access controls are lying around keys. Do the data, data types that you've chosen for these work are actually supported by the aws services that implement this functionality. This again one problem, there's like an exhaustive list of things we did with glue and also with later on with um with red shift to actually achieve the performance results, we got this. If you're doing a project like this, this is like an area to focus on.

The second area is one that takes people by surprise, it's called permissions debugging. So if all of you have been in the governance space, you've probably seen this problem before. A bunch of people in a room, one group's presenting, they're showing the slide like the one on the left and it's showing this metric is flat, someone else in the room stands up and goes, no, it's not. I reviewed this yesterday and this metric is actually increasing period over period. So what happens is inconclusive. Meeting. People have to take a break. They go out and they do the traditional governance process of all right. Do we have the same metric definition? Yes"

Are we pulling data from the same place? Yes. Ok. What's wrong? And what you can see at the bottom is what's really happening is under the covers. An additional predicate is getting added to the slide on the left and unbeknownst to them, their data is actually different than the one on the right.

So how do you tease this out? So we came and again, going back to our scale problem, we actually had to have a mechanism of actually, you know, produce, showing, being able to make this self-service. And so we created this process called permissions debugging.

And so you can find your data set and say, what are my effective permissions now or what are the effective permissions for identity z? And what it does is it actually shows you the names of the policies. And here you can see that there are two and you can see here that I only have access to the default policy and that there's another one sitting there.

The reason we went with this approach rather than showing essentially the raw sequel is this is easier to understand for business users. And there isn't so much a concern of like the raw sequel, potentially showing sensitive data. Here, we're showing the name of a policy. This is also actionable. I can look at this and go, I don't have access to region one. And from here, I can just request permission if I have a business case, I get permission and this is all resolved.

So this was a big area. And this one, we feel that we're as I'm starting as we're continuing, our journey is going to become a big user education issue, having a way of doing this without you or your team who are building this out, being in the middle and answering this question is critical.

Last piece Amazon Redshift Data Share. So this is again, I'm just showing the change to our integration points. And so you can see on the right, we basically have a set of clusters, we call producer clusters which are owned by our own team. And we still have our Redshift synchronizer. But instead of copying 200 or 400 or 600 copies of data in the Redshift, we copy it once and all of this sort of lives and manages in our andes account similar to what i showed you with glue, we have thousands of clusters. They all exist in different AWS accounts.

And so we again use our subscription mechanism to set up a Redshift data share. That's a feature of Redshift, create a resource link. And then the query flow is very, very similar. The consumer cluster that one reaches out to the producer cluster, gets the metadata to execute the query, then contacts Lake Formation figures out what the row and column level constraints are and then successfully executes the query providing the consumer only with the data that they have access to.

So again, with this particular solution, we've now eliminated the redundant copies of Redshift data from n to one. And we're still using our essentially our single centralized permission store. So we don't have permissions all over the place.

So let's talk about our results. So 99% of our data sets are enabled for this infrastructure. What about the other 1%? Well, it's actually less than 1%. This is all these are all the long pole items that were related to that type slide that i showed you earlier instead of having like an unknown. Well, we have 2000 data sets and we're trying to figure out what's wrong with them. We actually have a plan, we know what data sets exist, they have a problem and we know what needs to be done. It's a matter of finishing up the data migrations, we expect to have that done by the end of the year. But it's our proactive planning that allowed us to get there.

The second aspect is the number of data sets which are exclusively on this infrastructure. The issue, the way we rolled this out is we basically said we will allow consumers to use the legacy query path or the new query path. And this allowed this allowed them to gain confidence with the infrastructure and allowed us to learn, allowed us to tease out issues with functionality and performance without actually having to deal with issues in production.

The 12% here, that's actually now closer to 16% represents data sets that had our most critical data. These are the policies we had to get done this year. And so those are all done. Our goal is to get to 100% or as close as we can at the end of 2024. So that's a pretty ambitious goal, but we think we can do it.

The rest of these results I sort of talked about before. And so in the end, I think we ended up in a good spot. So I'm really hoping that by sharing this information with you, you can learn a little bit from our journey and hopefully move a little bit faster in your own organizations.

So with that in mind, I'm going to invite Artie up here to talk a little bit about the future. Thank you.

Thank you Jerry for sharing your story, sharing your journey with us. Good evening, everyone. I hope all of you are having a good time at treatment so far for the sessions and the evening happy hours, right?

So that was Amazon Andy's data lakes team. It's a very large scale environment indeed. And they have millions of data sets, tables and exabytes of data. The PowerPoint slides cannot convey the big in big data. You agree with me, right? You know exactly what I mean?

So they have built a very custom solution. They have built a custom catalog on top of glue and a custom permissions model on top of lake formation. This is to answer their very specific business needs.

Now, every organization, every business has unique business needs. You have your unique questions that you want to get answers from your data sets. But does everyone have the time and resources to go build an elaborate custom solution. Do you need to do that in the first place? Right. Do you need to re invent the wheel every time?

So that's where the AWS managed services and its features come into play. You can start building a solution, a prototype quickly using the existing feature sets which let you start playing with them immediately. You can test them on your actual business data set itself. You can study them for scale performance metrics, observable and quickly deploy them in production and see how it grows and how it adapts.

So I'm going to show you a demo here and using AWS managed feature sets from glue and link formation. And we have built this from scratch this data curation pipeline and within minutes and hours we have built this, let's see how it works. Probably I have to log in. Just give me a second.

Ok. Let's start with the Redshift cluster because businesses gather usually a lot of data, your customer data, right? You're gathering how your customers interact with your business, your services and products. So you have over a period of time, you are usually typically gathering these data sets in your warehouse and you want to process them later at some point for deriving business insights from this.

So we'll start with this Redshift cluster and typically it's just any other Redshift cluster and a small demo cluster for us. And I've taken a customer data set which is having very specific customer information. This is just like any other data set that you gather and it has a 30 million rows. And when you see the columns, you have customer or your user specific information like name, email address, birthday information, country, whatever you are gathering from for your data sets.

And we also let's explore some of these and we see that this is actually a synthetic data set. Typically it has some transaction IDs and the email address that you see here, this is artificially generated. So it's, it's nothing confidential information, it's for demo purpose.

So we're going to take this data set in Redshift and we are going to see how we are able to curate this further by building a seamless pipeline.

So here Mahesh talked about choosing the right tools and creating a data driven culture, right? You want to ensure that when you're designing your decisions, making decisions from your data sets, you want that data set to be valid, meaningful and have accurate data, right?

So, Glue's data quality feature provides that functionality. It's only called DQ, it has prebuilt templates, you can drag and drop in Visual Studio, that's what we are going to do here.

So here you are going to point directly on the Redshift cluster by creating a connection in Glue, we are not making copies of that Redshift data into S3 and you're going to directly point and run our transformation. So this is like a drag and drop from the existing templates. Visual Studio allows you to do that.

We are pointing directly into the Redshift cluster data, you can for the transformation, we are going to run data quality tests on it. So here for our example, I have chosen if a particular column is a primary key column length is of certain characters and there are no null values.

And on the left side, you can see that from you have a list of prebuilt choices. You can choose and say what type of data quality test you want to run on your data sets in the table and just drag and drop. And you have a variety of options here. You can also customize it, run your own, you know, custom sequel and check for data freshness and all those stuff.

And as a result of this data quality test, what do you want to do in? So you can either choose to have the take the results asses and write it to S3 in Glue Visual Studio. I have taken a filter again, drag and drop box. I am choosing only the rows from the table which have passed the data quality checks. Ok? And then writing them into my data lake. In S3.

So essentially we have curated one step and directly pointing at Redshift cluster data. We are doing some data quality tests on it using Glue ETL and bringing only the valid true accurate data sets into my S3 data lake.

And you can also click on the script button and further customize it and write your own custom code on top of it and do whatever manipulations you want.

Ok, let's save this and let's run it and give it some time to finish. So we have done two steps here. As I said, we have curated the data by directly pointing at a Redshift cluster, 30 million rows, we allow it to run and we have crawled it using a Glue crawler on the resulting S3 data set.

And I have created the table in Glue's catalog and let's review the column information in Glue. So Glue's data quality DQ adds three extra columns to your table as a result of your transformation data quality checks. And we have filtered the rows based on this column, data quality evaluation result.

Only the, the rows that pass those data quality tests. We have brought into our data lake. So we had 30 million rows. Let's explore that table in Athena. We see that we have got only 28 million rows. So that's, that's you further any downstream applications consuming this data set, you have ensured that it's accurate data, there is no nothing null and valid data set.

So all your downstream applications are going to process and make the right decisions for you. And you see that we have taken only the rules that have passed all the data quality checks that we have added. And uh yeah, so this is one step of curation and we were able to do it directly without making copies.

Ok. In the next step, you're thinking about these uh regulatory standards and compliances for data protection. You want to ensure that your data is accessed only on an as a need to know basis. Right? In our customer example, also we saw that you want to share the data with just one copy and uh and uh you want to be able to um go detect all the sensitive information, classify them into uh sensitive or not sensitive and share 11 copy of the data to the rest of your data users. For that.

Let's see, Glue's data sensitive sensitive data detection feature provides you that functionality. Let's again, go ahead in Glue's visual studio. I have used boxes, drag and drop these boxes. Now we are pointing at the catalog table that we crawled, which is the result of the DQ data set. And we are uh this allows you, the data sensitive data detection is again a prebuilt feature. You can go and choose from a list of choices.

You can choose what is the sensitive data that you want to detect, you can choose the sampling size and the threshold where you want to say, hey, i want to qualify this as a sensitive data set in our data set. We saw some personal names were there, first name, last name and then we had some email address. And uh for our demo, i added us phone numbers and driving license. You can also have credit card numbers, ssn all that. So these are all prebuilt checks which will qualify as sensitive data uh for your uh data sets.

Ok. And uh the last uh point, what do you want to do with this uh data set that you are classifying as you know, detecting as sensitive and not sensitive? Right? So here i have returned a custom uh transform to see what to do with the output of the data. Let's take a pause.

Here, you have a variety of roles, especially data roles are coming up in organizations. There are data scientists who want to just play with and explore the data sets. They want to build their machine learning models. They don't need to have access to all of the pi a data in your data sets, right? And then you have um data analyst roles. They are exploring the data building dashboards, creating reports, they may need access to some p i data, not all of it.

Again, we have some data engineering roles, these roles, they are writing scripts for uh applications to consume that data. So probably they need access to the full table, they may need their applications or processing the data fully and storing it in an optimized way or doing something else, right? So you have a variety of data user wanting to use the same table of data.

So you want to be able to do that and have one source of truth, but able to share it to different users with different data restrictions. Same no copies la formation allows you to do that. So here as an output of the blue sensitive data detection, what we are showing you here is add la formation tags to the scanned columns of the table and you add the tags to classify them as whether this column has the sensitive data or not sensitive data.

Again, you click on the script version of the same drag and drop box. You can go and customize and add further code and classify it and do whatever output action that you want. That is exactly what i am showing you here. So here i'm classifying or adding the lake formation tags to the table and classify them into this column as sensitive or not sensitive.

So at the column level, we are able to uh do the tagging and the la formation. Um i'll show you i'll save this and run and wait for it to finish. We'll see the inspector output and see how the resulting tags take effect. On the tables in lake formation and then we'll query it further.

So here is uh you see these uh table in lake formation in lake formation, you are able to see what's the tags on the database table and individual columns. Ok? We have taken the table and now you will see that the tags added are i have created it uh previously the classification tag into sensitive and not sensitive. And you see that there will be few columns here that will have uh sensitive and not sensitive information.

We will query them in athena and i'll show you the difference on how this is. So here i have created two roles to for our demo purpose, data analyst role and data engineer role. The data engineer will be using an etl role. two i am roles referring to two different users in lake permission.

I have already created uh uh permissions uh policy to access those resources which have these tags. So let's see, the data analyst can access, the data analyst can access only the not sensitive columns so the data analyst can use all the resources which does not have any sensitive information. And when you see um in the data engineer etl role, they have access to the full table, meaning they have access to all the resources with lake formation tags.

Classification equals sensitive and not sensitive, meaning they have access to all the resources which are tagged this way. So they can access all the entire table. I have logged in as data analyst now. And you can see that when the analyst queries the table or seize the table even in the glue database or in the lake formation catalog, they are not going to see those three columns which were detected as sensitive.

The 1st, 1st name, last name and we had the email address uh in this particular data set, we did not have um any other confidential information like a phone number or anything. Ok. So let's see in the query in athena also and see what the data analyst will see. So he is saying all the uh quality results that uh passed uh the data quality checks. And uh when you see that all the other columns are visible and this is the particular view for off the table for the data analyst.

So they don't have the visibility into those three columns. So what happens if you go because we know what the name of those columns are, let's query and see what happens. So the column doesn't exist for that user. So the data analyst view of the table is completely restricted. He doesn't even see the information or the columns that's not accessible to him.

Let's log in as the data engineer's etl role and explore the same table. So understand that there's only one data set and we have created permissions and shared using formation to different rules with the restricted way the engineers on etl rule, they'll see the full table with all the column access.

So link formation lets you share the same data set to different users at column level, row level and cell level. So you have very fine grained access control using lake formation and it's a regular full table for the engineers. That's, that's the key here. He sees the email address, the first name, last name, etc. I'm not able to shift the view. I need help here.

Uh a i need help to switch to the powerpoint view. Oh ok. Ok. Sorry. Yeah. So this is the architecture diagram showing the demo components. So using uh blues features, you can curate not only the redshift cluster data, you can curate the redshift serverless, you can point and uh you can point at your data sources residing in uh on prem servers, dynamo, db, documentdb, opensearch, etcetera, etcetera, and also third party data sources.

No, the data curation picture is incomplete without proper mechanism. For data discovery. You want to be able to connect your people to the and provide them the right data sets and also provide them the tools to analyze that data right away, all with ease, right? And you do not want to go through this process again and again, whenever a new data set comes up or whenever a new team is created, right?

You want to be able to think through this process once and make sure that you are connecting your people data and provide them the tools right away. Amazon datazone provides you that functionality. It's a fully managed data management service. It enables automated data discovery using advanced machine learning features and enrich your data sets, which is the technical metadata catalog into business catalog with a the gen a feature which was launched yesterday.

Your data sets are able to give you a lot more descriptions so that your data, users are able to easily find them using search terms that they use in their day to day life, right? So you're connecting your data sources to your data users in a produce in a publisher subscriber model.

So the data user goes and searches for the data using simple terms and records access to the data the publisher sees and who is accessing the data, able to understand the need for access to that role. And they are able to then grant access. Once the user gets access to the data, they are immediately able to explore the data using the tools that they are allocated with that data set.

So that is the data zones project where you connect your people data and the tools together in one place with ease it speeds up collaboration between team members to explore uh your data sets much quickly, effectively speeds up your business decision process all within the premises of strong data governance with formation permissions underneath.

I encourage audience here to go and explore amazon datazone uh in other re invent sessions with that. Um we are at the end of our session. So we discussed what are the challenges in curating data at scale in a large organization and some possible ways to think of solution. We saw how the amazon.com uh data lakes team has curated their data lake using the latest features of aws manage services like glue lake formation and datazone.

You can curate and build your pipelines data pipelines much more seamlessly effectively in hours, two minutes time. So, and you're able to end of the day, you are able to share the right data with the right people and help your business processes. So happy curating data and uh thank you so much for your time and attention this evening.

Uh these are some informational slides here if you want to take a picture of it and please help to fill the session survey. It helps us to deliver better and better sessions every year. So thank you again for coming. We'll open the floor for.