What’s new in AWS Lake Formation

最新推荐文章于 2024-09-07 11:35:15 发布

李白的朋友王维

最新推荐文章于 2024-09-07 11:35:15 发布

阅读量125

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134788730

版权

Good morning. So as a quick show of hands, who in the audience is using Lake Formation or Glue Data Catalog today? Ok. That's good. That's nearly 100%. That's great to see.

So for all of you that are using Lake Formation and Glue Data Catalog today, I'm very excited that we, we're going to have an amazing list of, of announcements for everyone that isn't using Lake Formation or Glued Data Catalog yet. I'm pretty sure that, uh, at least some of the stuff in this, in this presentation will hopefully help you change your mind.

Um, out of curiosity. Did everyone have a great night yesterday during reinventing? Good, good. You know, just making sure everyone's awake, that's, that's really important.

Uh so my name is Leon Dichter. I'm part of the product management team for, for Lake Formation and the Glue Data Catalog. And I'm very excited that today, I'm going to be joined by Preva Ariana Swami, one of our software delivery managers.

So looking back from last re:Invent to now, we, we have an incredible lineup of, of new features that we wanted to talk about in fact, we had so many releases that it was very hard to choose what to put in, in, into this one hour sessions and we had to make some very tough decisions.

So if we, if we don't cover your favorite release or favorite feature, uh please come talk to us right after the, after the session to make things slightly easier. We group things into four major categories. So those are Discover and Secure, Connect and Share, Scale and Optimize and Audit and Monitor.

And at the end, I'm very excited to welcome a special guest on stage, Brett Alford from Duke Energy to share his journey on using AWS Lake Formation and the Glue Data Catalog. So having said that let's, let's dive in.

Let's start with a quick intro into, into Lake Formation, Lake Formation and the Glue Data Catalog play an incredibly important role in AWS's vision on, on data governance. So throughout this session, we'll touch on the three highlighted capabilities, data profiling data catalog and data security and data sharing.

When we built Lake Formation, as well as the Glue Data Catalog, we did so to solve four major pain points in data management challenges. First off, you're going to have a wide variety of data sources that you need to capture, that you need to understand. And secondly, as that number grows, the volume of data is growing as well and as the volume of data grows the the number of people that need access to that data is certainly going up either.

And a growing number of data sources, a growing number of users, a growing data volume really means that you need some unified way to manage the permissions to manage data sharing in your organization. And finally, but certainly not least you need insight into who does what with what data at what time.

So throughout this session and these are the four major capabilities or pillars, however, you want to call them, we'll touch on these four. So we'll start by diving into discovering and securing your data, essentially looking at how you get data into your data catalog and how you can apply permissions on top of that.

After that, we'll look at connecting with, with data sharing and we'll look at how AWS partners extend Lake Formation and and a Glue Data Catalog even further. We'll look at how you apply your, your, your, your tag based access controls and how you can scale that even more.

And finally, we'll look at the audit and monitoring section, basically looking at who did what with what data at what time before we dive into all of that though, I realized that data is incredibly important data and all the policies that go with it.

So I'm very happy to say that over the past year, we've added Lake Formation and a Glue Data Catalog to the seven regions that you see on the screen right now. And that means that both Lake Formation and a Glue Data Catalog are available in all the existing AWS regions that exist today.

So the first theme that we want to talk about is Discover and Secure. Lake Formation and the Glue Data Catalog, make it easier to discover your data, to automate your schema management and to manage your permissions into, into one single place.

Thank you, Leon.

Hi folks. On the same theme of Discover and Secure, I want to give a quick primer on what the Amazon Data Zone is and how Glue Data Catalog and Lake Formation are building blocks.

As customers evolve their data-first journey, they are constantly looking for ways to simplify the way they discover the data sets, share, analyze, visualize and also most importantly govern the data sets. So they are looking for a more simplified unified orchestration environment where they can do so. And that's really where Amazon DataZone comes into play.

So with the notion of a business data catalog, what Amazon Data Zone provides is it provides a nice data portal and an orchestration platform where you can create abstraction for your technical assets. So you can bring your assets from Glue Data Catalog, Redshift catalog or your custom catalog. So the data could be like, you know, structured, unstructured, semi structured.

So you have your technical assets and in this business data catalog with the concepts of domain and projects. You can now create your own business ontology and nomenclature and you can expose these assets.

Typical organizations are made of like multiple lines of businesses. And they have a modern data architecture where every line of business wants to have control on what they define as the data product, how they want to be able to define their governance rules, how they want to define their business ontology, so that we do want to give the data producers or the owners that autonomy.

So in that modern data architecture, when you have different lines of businesses, producers wants to first publish their data sets into nice portal that then your subscribers or your downstream consumers can then subscribe to these data sets. And then there is a nice approval workflow that kicks in behind the scenes and that's really where like you authorized users on the consuming side can then subscribe to these data sets which are actually based off of your technical assets in the Glue Data Catalog.

And then behind the scenes, Data Zone uses Lake Formation primitives to be able to transform that approval workflow into authorization grants. So if you really see like Lake Formation and Glue Data Catalog access a building block. And you can think of this as like one unified cohesive platform where Data Zone gives you this nice abstraction of project which is use case based abstraction and you can expose all of your assets and group them into that encapsulation and start sharing with your set of downstream consumers.

So for folks who have been like, you know, hearing DataZone all along and figuring out like, how do I plug this with Glue Data Catalog and Link Formation? Think of them as like, you know, really one cohesive product that is going to like, you know, simplify your whole orchestration when you think of data discovery and sharing.

Next up, I want to touch upon some of the updates that we have on AWS Glue crawler side when you speak of discovery, right? Like introduction is like your first core phase in your data strategy. And for folks who are not familiar with what Blue Crawlers are. Blue Crawlers does the heavy lifting of literally crawling your very data sources and they are able to like classify them, you can tag them and so on and define a structure and schema to your data sets and bring them back into your technical meta store, which is the Glue Data Catalog.

So think of them as the very first step in your data pipeline, right? And in crawler, we have some new fellows to the portfolio, which is the very first update we have. Now we have bring your own driver version. So that means you can have your custom JDBC drivers that you want to be able to use as connectors to talking to your choice of the data source. Could be your PostgreSQL, MySQL.

So you have your choice of connectors and you can bring those JDBC drivers and plug them into your crawlers. Next up is, I want to speak a little bit about like, you know, the modern data as data is vast and varied and more and more data gets ingested on a day to day basis. Analyzing them. The key to analyzing them is really performance. You want your queries to run fast because time is money and data is valuable.

So you'd really want to like start deriving insights from them sooner. And partition indices are key to especially deeply partition data in a traditional hard meta store. So when you have such deeply partition data, you need to have efficient partition indices. So that from your choice of the compute engine, you're able to effectively push down those partition predicates.

So you're not spending that immense time in like, you know, scanning all the partitions and data. And that's really where crawlers are now automating the creation of those partition indices when those tables are ingested into the Glue Data Catalog. Because oftentimes customers tend to like, you know, either forget setting them or like down the line, then they realize like their queries are slow and then they have to do this as like, you know, a fast follow step.

So this way like crawlers simplify it and last but not least crawlers are now fully integrated with Lake Formation. What this means is prior to this announcement, you would have a crawler job role and then you would have to go and configure your IAM policies for your S3 data sources. And then you would have to like, you know, give the needed access grants and bucket policies to be able to crawl those data sources.

Now, crawler is fully integrated with Lake Formation and your crawler job role can work with Lake Formation either in account or cross account. So that way now you have Lake Formation governing both your regular user and group access, but also your crawler job role. So that way you have a more unified way to like start controlling all your access policies.

Next up, I want to briefly touch upon the support for open table format. It's really interesting to see a shift in the industry, right? Like when data lakes started, customers really enjoyed and loved the, I would say, like, you know, the disaggregation of like, you know, the computer and storage, you have like cheap and deep storage that you can like, you know, ingest any kind of modern data. And they love that level of, you know, comfort they got with and flexibility with the data lakes.

But now use cases are evolving such that they want all the goodness of the databases, right? Like the traditional database goodness of, hey, I want my asset transactions, I want to support like, you know, concurrent transactions snapshot isolation. I want to support schema evolution. And so you want all those traditional goodness of a database into a data lake world.

And that's really the pitch of these open table formats, right? So you still have your data in your S3 in your say Parquet format or your choice of your data format. And then these open table formats really give you this nice metadata layout that helps give your asset compliance or transactional capabilities and so on.

So in the recent past over the years, like you know, Apache Hudi, Apache Iceberg and Delta Lake format are all like, you know, becoming a choice to be able to support these modern use cases. And we are happy to say like Glue Data Catalog, you know, you can ingest all these table formats and treat them as just like any other native Glue Data Catalog table and you can mix and match them.

So if you want to run a query against these with like you know, your traditional Hive tables, they appear as just like any other native Glue Data Catalog tables. And once you ingest into the GDC, then you have all the goodness of Lake Formation. You can start applying Lake Formation, fine grained access controls on top of them and start sharing them, use Lake Formation, cross account sharing, tagging and so on.

So this would look and feel like just like any other Glue Data Catalog table and then of course, you can start consuming these from the choice of your compute engine that is capable of supporting these open table formats.

Next up is I want to touch up on like, you know, throughout the session, like, you know, one theme is really coming out very resonating, which is governance is now a first class citizen as customers are like defining their data strategy.

"They are increasingly asking for ways to understand and implement, you know, unified governance. And that's really like, you know, it needs to be a cultural shift and it's also important that it does not impede the business core business value because new compliance policies, new governance rules come into play. But you do want tools and primitives in place such that it's not a heavy lift and shift or a migration of your complete workloads and processes. But instead, like you have an organic way to like, you know, adopt to those compliance rules. And that's really the mission of Lake Formation. Like we want to give you tools and primitives so that you can slowly at your pace, adopt to those compliance rules and regulations without impacting your code business proposition.

So on that vein, I'm really happy to announce that we have a capability which is called hybrid access mode for Glue Data Catalog. What this really means is, I'll give you the analogy of it, which is imagine you started your data strategy. You have your workloads. Think of it as like you have your traditional batch workloads or some ETL jobs. You have set some job rules and configured certain IAM policies and things are working fine. You, you have set some coarse grain permissions on them. They are ingesting tables great. And then you have your classic analysis, use cases that are coming up from different lines of businesses and then there are more compliance regulations. You may want to like, you know, start defining more fine grain controls. You may want to define sensitivity detection on those data sets and you want to start defining rules against them.

So that means the same table is going to now be consumed by other use cases where you want to enforce a much more finer grain model. Prior to this launch, we used to have a scenario where then you would have to like then register this entire resource in a Lake Formation mode and kind of almost transform your current policies into an equivalent Lake Formation grants. And that we heard from customers has been a huge lift and shift. So instead now we are giving you this flexibility and primitive where we are saying like, hey, the same resource can now operate in a hybrid mode. What it really means is like, hey, I can continue operating my batch jobs and other things that were configured prior. But for my newer use cases, I will adopt this hybrid model. So I can opt in the newer principles to be able to get into the finer grain Lake Formation access mode.

So the same resource you can operate in a dual mode and that gives you flexibility. And at your pace, you can decide either to completely say move on to a more stringent finer grain model, or you could even choose to stay in a hybrid model if it suits your business needs. And as always, because cross account is one of the core primitives of Lake Formation. We want to ensure that as your use cases evolve, you are also able to share resources in a hybrid mode. So that the recipient account admin has the flexibility to decide how they want to further delegate those grants, whether they can start receiving things in a more coarse grain and then they can further do a fine grained delegation or they could again choose to like, you know, adopt to their compliance model.

So this gives you this nice decentralized posture to where every data owner and data producer and a consumer can define how they want to adopt to the permission management. And also the next step I want to briefly touch upon is we added another capability to the fine grain portfolio which is permissions on nested data types. As modern data is evolving, we are often seeing like there are deeply embedded types, deeply structured types of data. And oftentimes customers come up with use cases where they do want to control things at. Hey, I do want to exclude only a subfield of a certain deeply structured data set. And prior to this customers needed to like, you know, flatten their data sets and that incurred its own overhead. So now we are happy to announce that we have this permission capability on fine grain of permissions on these nested data types. So you can then finally control your column and row level permissions on subfields within a complex nested type. And then engines would do the subsequent enforcement on and ensuring like only those filtered data sets are shown in the resultant query.

So out of curiosity who in the room here is the administrator of their data lake? Good. So in that case, the the scenario that I want you to imagine is gonna be a lot easier for you. I want you to imagine that you are the administrator of a data lake and you have a bunch of different tables. So we have purple, pink and orange tables. At the same time, you have a bunch of different groups of people. We have the purple, pink and orange people, groups of people and for all of them, you create roles. So the purple group gets access to the purple IAM role to access the purple tables the same for the pink and orange ones. And that all sounds really good. I mean that's what that's what you would do today, right?

And then questions start to come questions from power users. Like I need to have access to both the orange and the purple data. Why can't I do that with one single IAM role? Why do I need to switch back and forth all the time from different roles? Or, you know, as you're imagining this scenario yourselves, how many roles do I actually need to create to match the groups that exist in my identity provider? From identity provider administrators. How do we map all those users and groups to actual IAM principles? And last but certainly not least questions from auditors like I see that someone accessed this role and accessed the data, but who was that person? And what, what did they actually do with the data? Those are all incredibly important questions. And that means I'm happy to announce that Lake Formation is now integrated with AWS IAM Identity Center.

That means that you can use the identities coming from your existing identity provider, whether that is Okta Azure AD or any of the others that IAM Identity Center integrates with and use those identities within Lake Formation to grant fine grain access controls. So I can create, I can grant access permissions to leon@anycompany.com in the same way that I would do today to any IAM principle. And since I'm granting access to leon@anycompany.com, all of the underlying services will understand that identity as well. So that means that from an auditability point of view, I have a consistent view of who accessed the data at what point.

So looking at the user interface of Lake Formation and why do you do this through the UI, the CLI or the APIs doesn't really matter. Your workflow doesn't change. It's the same way that you would grant permissions today to your IAM principles. Instead, those identities now come from AWS IAM Identity Center. And you can obviously use all the features that are supported by Lake Formation. And to get started with that, you simply go to the Grant Permissions tab in Lake Formation and use the new feature that is titled IAM Identity Center.

Now, I want to walk you through a quick example to sort of hopefully make this come for you more alive. Now, this is my setup. I have Amazon Redshift, I have EMR, I have Amazon QuickSight. And at the far end, my data is stored in Amazon S3. And obviously I use QuickSight and CloudTrail to visualize all of that. Now, my users are logging in to EMR Studio using the identity that they actually are. So they're not logging in with an IAM user, they're logging in with leon@anycompany.com.

Now, EMR Studio on behalf of leon@anycompany.com is needing, is going to need data. So it uses EMR and EMR on behalf of that user will reach out to Lake Formation and the Glue Data Catalog to get credentials to access S3 and get the permissions that it needs to apply assuming that everything goes ok? And you actually have permissions, you get EMR gets the credentials back and on behalf of the user EMR then accesses Amazon S3 to get the data.

Now, I've been saying on behalf of quite a bit in the last few sentences. And the reason for that is that, that actually shows up within your, within your audit trail. So that from an auditability perspective, when you think about those four major questions that we had on that previous slide, you can actually see who did what with which data in which service at what time making it incredibly straightforward to connect all those service calls together.

So now that we've discovered the data, we have secured the data, let's dive into how you connect all those different pieces together and how you share them throughout your company. So that second theme, connect and share data is all about letting your users gain insight much more quickly.

So in the theme of connecting and sharing the data, right, like the first and foremost is like our focus is on building primitives. We want to be able to give a place that like your AWS Glue Data Catalog becomes like this core technical metadata store that is really able to like, you know, talk to very different sources and you're able to like, you know, pull in the data and then apply all of your Lake Formation goodness on top.

So in order to achieve that one of the core primitives that we are focusing on is federation, so we want to be able to federate to different sources and be able to like, you know, make your Glue Data Catalog, the world class technical data catalog I would say. And as I was alluding to earlier, that's also going to become the core for your data zone and orchestration platform. So you have all of your technical assets in Glue Data Catalog that can later be exposed in further abstractions as you move up the stack.

So on that way, the one thing that I want to quickly highlight is Lake Formation managed Redshift data shares. Last year, pre Invent, we announced the preview and earlier this year, we went live with this. So a quick primer is that customers increasingly want their data warehouse and data lake journey to get closer. And you want a much more unified way of governing the policies and you want a more single unified place to like manage them both your data lakes and data warehouses.

So on that spirit, we introduce this Lake Formation managed Redshift data shares. What this really means is like you have data in your Redshift, data catalog and you have you create this encapsulation called data shares, put your Redshift based tables in them and views in them. And then you can through this federation primitive, expose those as regular native Glue Data Catalog tables. And as I've been telling earlier, like once the assets are in the Glue Data Catalog, you can start applying Lake Formation permissions on top of them.

And then you're consuming environment and engines can start discovering these assets and running queries against them just like any other table. So you can mix and match along with other Hive tables, other warehouse tables like Redshift. So it gives you a much more unified, consolidated place to discover your assets and also like, you know, define fine grain permissions on top and on the vein of an extension to this, we did support a new capability called the cross region capability.

So what that really means is in a modern data mesh architecture, we realize like, you know, you do have data sources that are in a different region. You may want to have like your central Lake Formation account in a different region that you want to control the governance and audit and then you may have different downstream consumers and lines of businesses that are also diverse globally.

So in order to support this, we we added this capability to be fully across region. So your consumer, your governance account and your producer data share environment. They can all be in multi region next up"

Speaking of uh building on the federation primitive, uh we have the, we went live with hive meta store federation. That means like again, like you may have very different data sources. You may have your choice of on prem um hive meta store or any other third party meta store that is compatible, then we are able to like connect with them and then dynamically pull the data sets into the blue data catalog as a native first class tables in the g dc.

So this is again building on top of the federation primitive. And once you have that, then you can like, you know, start sharing those data sets using the lake formation primitive and start applying fine green permissions. So on that way, um i am happy to announce that we are, we did launch this sunday, i would say the cloud trail lake federation.

So for folks who are not familiar with what aws cloud trail lake is. Cloud trail lake is a managed data lake that lets you store secure, analyze and visualize all the security and audit logs. So now imagine you have use cases where like, hey, ok, i have all these security logs but i do want for my use case some operational or it use case that i want to be able to run queries or join them with other tables that are not in the cloud trail like that are say in your regular glue data catalog.

Now with the federation perimeter, we have the capability that you can go to your cloud trail lake and just say enable federation. So that way like your assets in the cloud trail lake are exposed as a native first class tables in the glue data catalog. And then again, behind the scenes, like, you know, it would just appear as a native table and then we are able to connect to those data sources and pull them and then they would look and feel like any other native table.

So that is the powerful capability that we have with cloud trail lake federation. And then you can like, you know, start querying them through your choice of engine. And athena is the first one that we have launched for this re invent.

Next up, I want to talk about views as I was alluding to earlier in the world of like, you know, open table formats, how customers are always seeking all the goodness of databases in a data lake world. This is again a common theme, right? Views is an age old database concept, but it's a very powerful concept, right? It gives you this no abstraction of like, hey, ok, i have all my base tables that i want to join and run a fancy queries on them and views. Give you this noise abstraction.

So in the data lake world in blue data catalog, we want to like bring that goodness to you. So we are now offering a first class perimeter called blue data catalog views. What this is is like. Again, it's the traditional concept of like, hey, ok, now i can actually create a view which is nothing but like, you know, you give your sql definition of how you want to create a view. It would be it could be joining multiple base tables.

And the point to note here is that those base tables could be a native glue data catalog table, could be a federated table, could be a cross account table. So once you have them in your catalog, it just looks and feels like any other table. So you can first create a view by saying like, hey, this is the view that i want to create and that would appear as an entity as a first class object in the glue data catalog.

And the next up is we, we spoke to many customers and realized that as your choice of compute and your use cases evolve, there are nuances and differences across all the engines, right? Like you know, when you think of a sequel, the dialect and semantic differences between spar and red shift and press toes.

So we want to embrace and empower your use case in a way that you don't have to like, you know, proliferate your objects in your catalog that is you don't have to go and define a complete new object for every sequel definition because of your compute top choice and differences.

So instead we are offering this capability called a multi dialect view. What this really means is like it's still a single view object. But you can now say like ok, i want this view definition for spark engine, this for red shift and this for my tree or trust engine. So you can actually segregate the dialects and sequel definition with the intent is the same like at the end of the day, it is still intending to do the same thing, but you still are embracing the semantic and syntax differences that are nuanced within the dialect.

So you have that multi dialect view definition and then once you have that a views and no obstruction, it just looks and feels like just like any other table that you can then start querying against and you can share it with your other cross accounts and other consumers and so on.

Another interesting posture to this is um which is not um what traditional databases support is that typically when you share a view, your consumers should also have access to those underlying base tables because at the end of the day, when you are materializing the view, you need to actually know what those base tables are. But we are giving you a nice posture called the definer views.

What this really means is as a data owner or a producer who is creating or orchestrating this view only that persona needs to have access to those base tables. So once you define the view, then you can actually start sharing that view to your varied lines of businesses and other consumers without having to share the base tables.

So that means the consumer running the query against this view need not know of or need not have access to those underlying base tables. So that's a powerful concept we believe hearing various customer use cases. So that way you can create this nice abstraction without having to worry about sharing all of your assets to your downstream consumers.

And this is a nice example to orchestrate, illustrate that on how you have your base tables, you may have some consent or entitlement tables that you want to join against. But the end result is what the nice slice and dice of that table is what you want to share without having to share the base tables.

So aws we we build services to let our customers choose the right tool for the job and especially uh from a data governance perspective, it's really important that you get to choose the right tool that you as a user are most comfortable with. So we built our services to be straightforward to integrate with and to extend um whether it is a existing business metadata catalog, whether it is existing data governance tools or whether they are existing third party uh query engines.

We want to make sure that they all can connect uh to, to aws lake formation and the glue data catalog in the same way that internal engines like amazon athena or amazon emr can do as well. Now, over the course of the existence of glued data catalog and lake formation, we've been very happy that over 20 partners have integrated with that already.

And today, i'm very excited to announce three new additions to that uh to that partner list starting with calibre with culebra protect as the uh as the service that integrates with with glue data catalog and lake formation. They allow you to u to have one single overview as a business metadata catalog of your data and how you as a user want to protect it.

So you set your permissions within uh within calibri protect. Que libra pushes it into, into lake formation and lake formation will make sure that it that those uh policies aren't enforced at the right moment. It uses all the goodness of uh of lake formation like tag based access control. So anything that you would expect from that integration is in there.

Second is dram. Dram o is a is an, is a query engine out of curiosity. Is anyone using dr o here? Good. I see some hands coming up. So drm o integrated with lake formation and glue data catalog to get permissions at runtime. So when you run your queries within d dr o, dr o use lake formations. Api s and the glue catalog api s to get the, get the permissions and at runtime and force those permissions as you run your query.

And last but not least, previa. Previa built, not one but two integrations with whit lake formation. The first a, a push boat sorry is um essentially as you'd, as you'd expect you define your, your permissions in, in privices u i and previa make sure that for the lake formation integrated engines, lake formation gets told of what it needs to enforce at what time.

The second one sort of flips that model around and it is called the pool mode. And as you'd expect from that name, it automatically pools at certain time intervals. The permissions from lake formation transforms that into something that previa then can enforce on all the engines that previa is connected with.

So whether it is using an existing metadata catalog, uh using an existing query engine or using an existing data governance tool. As i said, we want to make sure that you as a customer have the best tools for the job.

Now as preva mentioned cross-region table access is incredibly important because tab data and tables exist in different places. And sometimes that means different services, sometimes that means different systems and sometimes that means different regions in order to make sure that all those query engines that we talked about can access the data where the data is, we've enabled cross-region table access.

So that really means if you want to run your query in amazon, athena, amazon athena can use resource links to get to the uh to the data, to the metadata without having to copy all of that over into the region where uh where you are. So you have the data in your catalog, you have granted access to the people who actually need it.

Now, let's talk about scaling and optimizing your data and your data governance. And this theme is all about doing uh doing things at scale. The first thing that i want to talk about something that i'm very excited to uh to share is the general availability of aws glue data catalog statistics.

If you're familiar with cost based optimizers, um who, who is familiar with cost based optimizers? Ok? I see a few hands going up. That's great. So a cost based optimizer is essentially part of a query planner in uh in analytics engines. And they make sure that based on uh based on certain statistics based on certain factors that they compute what the best query plan is for, for the engine to use, uh especially useful when your table has multiple joints and is a very complex sql query.

So that cbo that cost based optimizer automatically calculates based on the statistics, it has what the best plan is to go execute and that's why we're happy to share that glued data catalog. Statistics is g a and integrated with engines like emr uh like redshift and athena.

So without that native support, you would have to do all the undifferentiated heavy lifting of getting statistics in there yourself. You'd have to manage source code, you'd have to manage pipelines and that's just not something that you want to be doing.

So with this feature, you can actually start your statistics collection jobs right within the glued data, cadillac u i or from the integrated engines. So speaking of scale and optimize i want to give a quick primer on like formation tag based access control.

So when you really want to scale your permissions, right, like you know, you have set of resources that are constantly inju into your data pipeline. And that's really where tags simplify the posture. So tags are typically think of them as a key value pair or a key multi value pair. Where for the set of resources, you can first define the ontology, you can classify and say like, hey, ok, this is my use case. These are the tags that i want to define and then subsequently you assign those tags to resources.

So then you actually do the classification and say like, hey, these are the attachment that i want to do and then start making the grants in terms of tax policy grants that is you start granting to your principal and orgs and our units on the tags. So that way, like as new resources are ingested, you don't have to proliferate your grants. You don't have to again, go and say, like i want to grant to this principle on this table. But instead, like your tags are like, you know, it's part of your induction pipeline and they are doing the classification of the resources and you get that efficient policy management.

So up until now, this used to be an data lake admin only operation. So the creation of the tags and actually attaching them to the resources and doing the t policy grants and as data architectures evolve, we are hearing the themes around like, hey, i do want a decentralized architecture where if you really talk about a data mesh, every data owner or a producer in some form or fashion needs to add that admin hat

so that's where we have now introduced a new capability called the lake formation tag delegation. where a data lake admin in your governance account can actually delegate the task of defining and creating and managing those tags and tag policy grants to another persona could be a data to word or it could be like, you know, another admin like persona in your data producer accounts.

so this way you get like that autonomy in a decentralized architecture. so every data producer is now defining the ontology of how they want to like, you know, um structure and share the data sets across

next up. in terms of optimization, i want to touch upon the general availability of a new capability for the glue data catalog that we added, which is automated compaction for apache iceberg tables as i was alluding to earlier again, like you know, the open table formats are evolving and small file problem is real, right? like you know, as you have modern data, a lot of small files get generated from your tweets and images and snapchats and what not. and on top, like you have like say a record level update that you know, you're just updating some bytes or some portion of your record. so they are all constantly generating more files. and then at the end of the day for your analysis use case you want to be optimal. that is like, you know, query engines should not be spending tons of time just scanning all those small files. and that's really where in a classic database world you have the bin packing. and now in blue data catalog, we are simplifying that and giving you an automated compaction capability. so all you have to do is just go and say like hey enable compaction and we use a much more efficient way to do this behind the scenes that we detect the entropy or the level of the small file generation and file change. and then we start automating this behind the scenes so that your small files are constantly compacted in a much more efficient way and we do it in a way such that your front and jobs are in queries don't get impeded.

and the next theme is i want to briefly touch upon the audit and the monitor capability again, like, you know, as leon was highlighting like with the integration with the identity center and with other capabilities that we are adding, we want to make sure like you have visibility and insights into your cloud tr logs and other places in understanding who has access to what right like that is important. and if we really want that glue data catalog to be that technical meta store for all your assets and la formation to be like controlling all the grants, then we want to make sure you have better primitives in terms of auditability and reporting.

so on that vein, um up until now, like as i said, like there is this persona called data lake admin, which is a powerful rule in terms of all the mutable actions that they could do, they control on how the data lake has to be like, you know, orchestrated and all the grants that are made and so on. however, like in a decentralized model, we are hearing like, hey, it would be good to have an auditor kind of a persona which is think of them as a read only admin. so they get privilege to look into like all the resources and all the grants that were made without giving them the full power of making mutable changes. so that's where this capability we have added it as like a read only admin role with a managed policy. so that way now this persona can be your technically auditor role to be able to like, you know, monitor the grants.

so with that said, i know that this was a huge lineup of things. i'm really excited to share the stage with brad alford who from duke energy is an enterprise architect and he's here to share the data la journey and how lake formation has helped them in their journey.

thank you. i look like. thank you. perfect. so allow me at a ma moment to share a few fast facts about duke energy. duke energy is a fortune 150 company headquartered in charlotte, north carolina and is one of america's largest energy ah holding companies. it's executing an aggressive clean energy transition to achieve the goals of net zero methane emissions from our natural gas business and net zero carbon emissions from our uh electric uh generation by 2050. this transition requires significant investments in emerging clean energy technology, diverse generation sources and storage, a smarter grid and a modern stakeholder experience. this will further require a transformative it and data landscape to be both an enabler and a force multiplier such new goals and priorities coupled with modern cloud compute data management and leaps and a i and advanced analytics. highlight the opportunity for change.

challenges include a rapid prof proliferation of new data sources, more unstructured and semi structured data with an increasing opportunity to mine data volume and velocity increases data quality issues exposed by new use case experimentation and limitations of centralization and scale of architecture, unsustainable growth and private data center capacity and cost and institutional knowledge at the business edge needed to maximize understanding and value extraction.

so business and it priorities demand a new approach to deliver outcomes with data. high value is placed on delivery speed, reducing time to value, simplifying data access and resource provisioning, building and maintaining reliability into our products and right sizing governance with data privacy and protection is a base expectation.

we use the data mesh approach and data products internally branded as a data fabric to build better data driven outcomes. the approach also incorporates elements of a data lake house architecture and lake formation as a core component. the approach further biases towards decentralization in all but common tools and capabilities implemented with a self-service model and federated governance.

so data products have clearly defined owners, a durable product team or domain aligned focus on value are built by technologist and business sme s integrate with technical data catalog in a marketplace.

so with all this talk of data mesh and data products, how does lake formation help it does? so by providing a growing set of data management capabilities that enable cloud scale data and analytics including a governance function, incorporating data catalog integration and managing structured data ov over simple or complex data types.

tag based access control simplifies data entitlements at scale open table formats, integrate a wide variety of analytic products and services from third parties, open source and aws these formats enhance the data lake with database like behaviors including transactions up certs, indexes, concurrency time travel all while remaining independent of data processing, compute services.

so now let's get to what, what is duke energy doing? so we start with a a product, a product account with all the requisite services to ingest data and then make that data available to other products and other other accounts. so you have all those services like s3, your ingestion services, your logging and monitoring, uh your lake formation and your glue catalog. and then this this this product team maintains a manifest and the manifest describes the tables, the the database, the tags that are to be used as well as how those tags will be applied for permissions or entitlements by the actual product team.

next, we established a central governance account and built upon a key set of services like lake formation, glue glue catalog k ms e event, eventbridge i am. and then, and next we use terraform enterprise and terraform enterprise actually executes a set of modules that take that manifest file from the producer product repository and actually apply the me metadata, the tags the entitlements and all that in the both the central governance account and then create the resource links, share the tags back to the ah producer product account and then set up entitlements there as well. all changes to the data catalog resources are logged. we use a default event bus via eventbridge to catch relevant data changes or data catalog events uh and transform them with a lambda function to easily in in in an easily interpreted format that is then sent to a data catalog account in the data catalog account. an event bus receives the central governance uh account, data catalog events and a rule is set up to process those and and a a lambda function then calls the api s for the enterprise business data catalog. this keeps the business data catalog uh and data marketplace in sync with the technical data catalog.

next, we have a consumer consumer data product account which may be provisioned with a minimum set of services to integrate with the data mesh and lake formation glue km si am cloud watch plus any specialized services required for the use case to reduce data copies. uh the the consumers access the data in place and a stakeholder for the consumer data product makes a request to the producer product account and then the producer, the producer product ac account and the stakeholder there maintains a manifest file specific to each consumer account which describes the catalog resources to be shared. and the entitlements that are required in that consumer account.

so then a tariff form module, uh a consumer tariff reform module was created and used by this process to ensure that there is a role in the consumer account. so that that the subsequent tariff reform processes can act have the, the necessary requisite permissions to create the resources in the consumer account. and then lastly, there's a producer consumer module that creates the catalog reso um creates the uh entities in the c central governance account. the principal assigns the tags and then shares those tags back to the consumer account and sets up the entitlements in the consumer account.

so then finally, in the consumer account, you will have catalog entries in your local account that will allow you to see the resources that exist in the, in the producer account.

so where are we in this journey of, of data maps? so we've delivered multiple products and re architected from le legacy on premise platforms. we're actively in development for roughly fif 50 different data data products as we speak across multiple domains. we're actually supporting a data science innovation innovation hub platform with many experiments that are actively underway. we have an ml ops pipe pipeline, we call a i central that is u that is leveraging this archi architecture. um and we continue to advance data product thinking within the company.

so where do we go from here? so what's next automation, automation automation so automating the account creation, automating roles and policies, automating um data pipelines. next, we want to extend the data fabric access bridge, which is our sort of internal brand for the central governance account model. we want to extend that to support aws red, red shift data shares. and we also want to create bidirectional integration with our business data catalog. meaning we want the customer experience when they when they want to request access data that starts in the business data catalog or the data marketplace. and it ends in the back end in implementing the per permissions that are that are necessary.

so that then we want to experiment with all these new great features that have that have been created for us um in lake formation that you've heard about earlier. so lake formation, uh red shift managed red shift data shares, um integration with id identity center, uh the glue data catalog views hybrid access mode, cloud trail lake. we also want to take a look at amazon datazone. you know, could da datazone do all that we're doing in our custom implementation or do we want to continue to pursue our own custom imple implementation? and then let's explore the cross availability zone and region considerations to have high availability and a reliable and resilient platform.

and with that, i will turn it back over to leon to kind of close this out. thank you, brad, really appreciate it.

so for our q and a considering time, we'll actually do that outside of the hall. um as we exit, uh p via bratt and myself, we'll be there to answer any questions that, uh, that you may have. and with that, i really want to thank you for being here today. being at, uh at re invent. if you are still looking for some amazing sessions, i would check out some of the ones that are, that are on the slide uh right now.

um and during this, during a session, we've taken you through the journey of uh of aws lake formation. we've talked about discovery and securing data. we've covered uh uh connecting and sharing data. we've covered scaling and optimizing and uh the audit and monitoring capabilities as well. and obviously, we just heard from brat, their, the amazing story of, of duke energy, how they are using uh glue data catalog and aws lake formation in, in their data journey.

now, at aws, as you all, uh you all know, we are a data driven company. so we would absolutely like to ask you to fill out the session survey that's gonna pop up in your, in your mobile app because we want to make sure that we provide the best content possible to you. also, if you haven't done it yet, i strongly recommend that you try out who is your uh your, your aws uh data superhero. mine is wire weaver, which is why it's on the slide right now.

so with that, i thank you all very much for being here. i hope you have an amazing day. i hope you have an amazing re invent. thank you.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
What’s new in AWS Lake Formation

Ok?
复制链接

扫一扫