How to build a business catalog with Amazon DataZone

本文介绍了AmazonDataZone,一种新的数据管理服务,它允许组织构建可共享的元数据层,使员工可以发现、使用和分析数据。通过业务glossary和标准化元数据形式,DataZone促进了数据的一致性和可发现性,包括自动化工具支持的数据迁移和更新管理。
摘要由CSDN通过智能技术生成

Hey, everyone, welcome. I'm getting hungry, looking at all of you. Yeah, just kidding. Uh so it's great to see you all and um talking to many of you today all all through the day. One thing that has become very clear is we have figured out where and how to store the data and, and the real challenges. How do we make the data available for everyone across the organization? And if this resonates with you, can i, can i get a show of hands to see if that's what you're trying to solve? Ok. Awesome, great.

So here in this session, we are going to talk about Amazon DataZone, a new data management service that was announced at Re:Invent last year and we generally made it generally available in, in October. And uh what DataZone tries to do is for you to build this active metadata layer. So everyone across the organization can share the data if they choose to, for everyone to discover and use it or to go search, find, understand and subscribe the data for, for their analysis.

I am Priya Th Dai, senior project manager with Amazon DataZone. And with me, we have Leo joining us uh to show a demo of how it works.

Alright. So to briefly talk about what DataZone is, I'm going to introduce the capabilities that it offers and then we'll see a quick demo of how it actually works.

So what do you need when you want to build a catalog? And their catalogs has been there for a while. It's nothing new but catalogs have evolved over time where it has to become extensible, scalable with all the business needs for the modern stack.

So first you need to be able to have organizational domains for for you to know who owns the data where it's coming from and whom to get access from. Next is how do you curate the metadata? So everyone understands the data and it is for all the users so they can be technical, technical. So usually looking at a technical catalog, you understand the role and the columns, the technical names, but it doesn't give much information for a business user to know what it actually contains.

So you need to have the merida curation to build that context around the data that you want to share with others in the organization. And to do that, there are two foundational blocks. One is the business glossary which has helps you translate the technical terms to business terms and the metadata forms that gives you the scalability to include information in your asset.

So anyone browsing and searching can understand the asset. Better think of when you search for an asset at amazon.com, you look at the manufacturing details, you look at the product dimensions, those are the metadata forms that you can actually add to the data that you want to share across the organization.

So how do you make data discoverable first? You want to ingest the data, bring it into DataZone so it can be cataloged and then you enrich it with a lot of business information. So everyone looking at it will understand the data that it has and then you have the consumers who can go to DataZone and go subscribe to it and and take it forward with a tool of their choice to be able to, you know, run analytics or do a dashboard or build a blender ml model or whatever they want to do.

And eventually these consumers can actually become the producers and catalog these assets that they are building. So it becomes discoverable for other users to build on top of.

So how do you get started? How do you build these building blocks? So, so the data becomes discoverable. The first is assets. How do you bring the assets into the DataZone? The DataZone is built on top of the Glue. So by leveraging the strength of Glue, we are able to bring all the technical metadata that Glue finds from the sources or Redshift and and bring it into DataZone.

And with these assets, you can actually build a lot of context around it, enrich it and curate it in a way that anyone can understand the data and see if it fits what they're looking for.

Business glossaries is like a dictionary of terms. So you want to standardize the terms across the organization because account for a marketing team can mean one thing, whereas account for a sales team can mean something else. So with the views of business glossies, you can make sure everybody is on the same page and they understand what an account actually means and what it brings to the table.

And third is the merida of forms, which is nothing but a key value pair kind of form that you can use to accentuate your assets, build some context around it and also build some consistency for your assets for everyone to trust the data and know where it comes from and and understand who whom it belongs to.

So they can they can uh you know, confidently use the data for their analysis and all that you're using to build these layers can be searchable. And this is this is how the DataZone powers the search. So the end user, when they search, they can use these levers to filter out the right results so they can take the assets forward for their analysis.

So what are the foundational models? First is the glossary and you build the glossary and the terms belong to a glossary. So you can have multiple glossaries in your uh catalog. So you can have a marketing glossary, sales glossary and it helps you build that uh dictionary of all the terms. So everyone understands across the organization then is the metadata form that you want to have standardized across all of your assets.

So this can be created by your admin or a data steward who says, ok, this is the consistency i want to bring for everyone across the organization. And this has to be of the asset that you want to publish. And these forms the producer uses as part of the publication process where they embed it as part of the asset and then make it and then they publish it, which actually means it becomes discoverable for everyone across the organization.

No. So let's look at how you're going to accentuate your metadata. So first, the data asset is how it looks like on the right side. So it is like a wiki page for your asset. It has all the details of your asset, the name, the description, it has all the technical metadata that was brought from the source. So you can understand it better if you want to and also you can use the business glossaries to classify it or or tag it in a way that you can search based on those glossies and and the merida forms can become part of those assets that standardizes the asset asset details.

So like i said, if you go to amazon.com and look at a product definition, you can find the manufacturing details, you can find the product dimensions, you can find a lot of details that belongs to the product. And similarly, we want to provide the same experience for all the assets that you catalog and DataZone and all this sounds very manual, isn't it?

So how do we automate it? How do we make it simple for every user to use it? So first is the ingestion of data. So DataZone provides automated jobs which brings data from these different sources. You can schedule them or you know, run once. And as if you have a scheduled job, if there's a schema change, DataZone actually sends out not to the publishers to say, hey, this source that you brought has changed in schema. Do you want to notify or stop the usage to the subscribers? So it doesn't break any of the downstream activities and notify everyone? So there is no impact even when the data has changed.

And the second level of automation is how do you curate the data? And this is an area where DataZone is investing a lot. Um the first thing that we introduce in GA is the automated name generation which uses a lot of mission learning to bring the description for the assets and the columns. So you don't have to go do that manually when it's tens or hundreds of assets, you can actually do it easily but think of it in millions of assets. How are you going to build the subscriptions?

And one thing that's exciting that's coming tomorrow on Adam's keynote is one other new capability in this space. So do take the time to watch what's what's coming new in Amazon DataZone and see how it fits into your business use case.

So the assets are usually, you know, refer to all the traditional structured data with rows and columns. But that is not the case anymore. With DataZone, you can actually catalog assets that are beyond the structure tables. It can be a dashboard, it can be an ML model, it can be a SQL query, it can be a job, it can be just a link to something else. But what DataZone offers is for you to catalog those different types of assets and build that business work flow where the producer can know who's subscribing to it.

And the producer, producer can actually decide if they want to give access to the end user to use it for their downstream activities. And with that, I'll hand it over to Leo to show how it actually works. So you can see it working in the product.

Perfect. Thank you, Priya. Ok, folks. So uh Priya show you uh what Amazon Data does. I'm going to show you how to do it. Ok? I'm going to show you a very quick demo about how to build a um enterprise business catalog, um internet, sorry, enterprise data catalog using Amazon Data. Yeah, so perfect, good. Ok. Yeah, perfect.

So the first thing that we have to do is uh uh go to the um second perfect. Go to the catalog. Ok. The catalog option here we have three main options, rose, catalog, glossary, and metadata form. Ok. Let's start with glossary. We're going to create a a bus, a g business glossary with some terms.

So here you can see the list of all the glossaries that we created for this demo. Ok? I have like 25 but in real life, you can have 1000 if we select one of them, uh you can see the description, ok? A general description of that business glossary, but also you can see the terms and each of uh every single one of these uh terms with uh its own description. Ok? If you want to see more, even more details about this term, ok? You can uh you have the option to add a rm section on on top of each term and also you can create relationships um between different terms. Uh ok.

Also here you have the option to search for a specific term again as i mentioned before. If you have, do you have the option to search? And here you can see the information for example of a gps data. Ok?

So now let's create a business grocery for for this demo. Ok. Let's click uh create uh create gloss. So we need to start with the name of course uh certification status. Then we have to select the project that is owning that is going to own this uh grocery. And uh you have the option to, to add a description. It's always recommended to document everything that, that you are going to put here. Perfect.

So we have the business glossary term, uh the business glossary created. So let's add some terms. I'm going to add just uh two or three for this demo. So let's start with 35. Ok? Then we have to put the the description. Perfect. Now let's add uh two more create term. Again, we put in our pending certification, the the description. Great. And now let's start the la the last one now 35 and also the description perfect.

Now, as you can see here, you have your new uh business grocery which uh with three terms also here you have the option to add the rhythmic section that i showed you before. Ok? And the relationships now let's go to the metadata forms. Here is the remaining formatted transcript:

So metadata forms are going to allow you to add business metadata to your data asset. Ok? There are those are forms that you can customize in whatever way you want with the information that you think that is going to bring more value to your consumers. Ok? For this demo, uh we have three but I'm going to create a new one, ok?

So we are going to put um a data certification. I have to add here a technical name. Why? Because you can manipulate this metadata form using APIs. And in order to use the APIs, you need to have a technical name, a description of course. And once again the project that is going to own uh this metadata form. Ok? Perfect.

The form is created but it's empty. So let's add uh a couple of fields here. Again, as i mentioned before, you can customize these uh uh whatever way you want. So uh here going to create one called certified by the person that is certifying the data. Again, I have to put the the name the same behavior as the metadata form. Uh we have the option to put uh a description. This is optional. I'm going to put a very short description and here you have the option to select the field time.

So if you see we have the string and so on, so i select stream because it's a name, you can put the minimum length at the maximum and you can make it searchable and or required. Searchable means that if somebody search for this uh field uh as part of the data catalog, uh that person is going to be able to find this asset by this field. And required means that if somebody uh is uh need to fill out all the informa this specific field, sorry, in order to um publish uh uh the uh an specific data as ok? Perfect.

So well here uh let's create click, create field nice. And now let's add a second field. And this one, this one is going to be a slightly different because it's going to be related to a business glossary. Ok? Uh uh metadata forms allows you to uh uh link um the form with the um business glossary. So in that way, everybody inside your company is going to use the, the same terminology. Ok? That's why it's so important.

So here we're going to select the glossary that, that we recently created. In this case, uh glossary status. We have the option to allow the selection of multiple terms. I'm not going to do it and again searchable and required. Ok? Perfect. And remember you need to enable the metadata form in order to, to, to, to use it nice.

So we have the glossary, we have the metadata form. Now let's document uh an asset. Ok? So for that, I'm going to switch to a producer project. I'm going to select the data uh option here, I'm going to select inventory and here I have the list of my own published uh assets. So I'm going to u uh document uh traffic uh trials. Ok?

And here on the, on the top, you can see that we have a green message that means that uh Amazon Data on how to generate uh some metadata business metadata based on the technical metadata that was part of uh that is part of my data asset. Ok? Perfect.

So uh let me go to schema here. You can see next to each column, a green icon. That mean that means that Data. So uh um found a better option to name that column, but these names are business names, ok? That those names are not going to change your technical metadata, it's going to change only the the business one. Ok?

So once I agree with all all the changes I have just to accept it and now those are going to be my my business names. Ok? If we go back to the general business metadata here you have the option also to create a read me section on top of your um uh asset. Ok? Let's add a very uh quick description here of of, of uh oh sorry before I do that, uh I wanted to show you that you can also document each column. Ok?

So for that, you can select one of the columns, edit it and add aaa description, a business description if you want to. And also you can add a business glossary terms to each column. Ok? If you want to, in this case, I'm going to select a verdict. Ok? And I'm going to save it. So as you can see now the column is linked to a business class return. Ok?

So le let's go back to, to the business metadata. Ok? I'm going to add um a read me section generally eisen eisen. Yeah, something like that. Perfect a very short description. I'm going to s uh save it, then I'm going to add uh a couple of business glossary terms.

So as you see, you have uh three options to add business glossary terms to asset at the column level at the general level and using me uh metadata forms. So here you have the technical metadata, this come automatically from harvest, harvesting the metadata from the Glue uh catalog. And let's add now a business related metadata form. Let's use the one that we uh recently create a very short one.

So we have to put uh here 35 that the asset is certified. And here I'm going to put the person that certified the asset. In this case, it was uh Maya Gos ok?

So as you can see here, I have all the information that I need at the column level at the general level. Ok? And actually also the Data. So help me to, to uh refine um uh some metadata.

So after uh I document everything, the only thing that I have to do is click uh uh publish asset and just with that uh our data asset is going to be searchable uh for um for everyone that uh has access to the data cat. Ok? Yeah, 20 seconds. Ok? Perfect.

That's all for today. Thank you. Here's one done. Yep. The wrapper. Yeah, you can do it. Is that all those?

So yeah, thank you all. I think what I wanted to share is all the sessions that we have on data governance. It's a new track that was introduced at Re:Invent this year and DataZone is one of the key services in this track. So do check it out. We have uh sessions, workshops and chalk talks where we can dive deeper into the different capabilities that DataZone offers.

And we also recently introduced a masterclass because data governance is about the people process and tool. The data governance DataZone is the tool. But if you have the people and the process in place, then DataZone is successful for you.

So for this reason, we announced this masterclass, it's a set of videos that talks about the best practices on how to go about the people in the process and understand more about how you can bring it into your organization. We have a guide that can work you through it step by step.

And with that, I wrapped this session. Thank you so much for taking the time.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值