数据治理-过程-1-元数据管理-概念

最新推荐文章于 2024-04-17 17:04:44 发布

Xu Kun

最新推荐文章于 2024-04-17 17:04:44 发布

阅读量872

点赞数

分类专栏：数据治理

原文链接：https://blog.csdn.net/jiangzhenbo/article/details/85255240

版权

数据治理专栏收录该内容

7 篇文章 3 订阅

订阅专栏

数据治理-过程-元数据管理-概念

一、元数据到底是个啥？
二、元数据是从哪里来的
三、有了元数据，我们能做些什么
四、Data Catalog for Digital Transformation
五、元数据管理工具

转载 https://blog.csdn.net/jiangzhenbo/article/details/85255240

一、元数据到底是个啥？

1. 元数据（Meta Data）是描述数据的数据

这是元数据的标准定义，但这么说有些抽象，如果我们要描述清楚一个实际的数据，以某张表为例，我们需要知道表名、表别名、表的所有者、数据存储的物理位置、主键、索引、表中有哪些字段、这张表与其他表之间的关系等等。所有的这些信息加起来，就是这张表的元数据。

2. 元数据管理，是数据治理的核心和基础

为什么说元数据管理是数据治理的核心和基础？为什么在做数据治理的时候要先做元数据管理？它的地位为何如此特殊？让我们想象一下，一位将军要去打仗，他必不可少，必须要掌握的信息是什么？对，是战场的地图。很难相信手里没有军事地图的一位将军能打胜仗。而元数据就相当于是所有数据的一张地图。在这张关于数据的地图中，我们可以知道：

我们有哪些数据？
数据分布在哪里？
这些数据分别是什么类型？
数据之间有什么关系？
哪些数据经常被引用？哪些数据无人光顾？

所有的这些信息，都可以从元数据中找到。如果我们要做数据治理，但是手里却没有掌握这张地图，做数据治理就犹如是瞎子摸象。后续的文章中我们要讲到的数据资产管理，知识图谱，其实它们大部分也是建立在元数据之上的。所以我们说：元数据是一个组织内的数据地图，它是数据治理的核心和基础。

3. 元数据是描述数据的数据，那么有没有描述元数据的数据
有。描述元数据的数据叫元模型(Meta Model)。元模型、元数据、数据之间的关系，可以用下面这张图来描述
在这里插入图片描述
对于元模型的概念，我们不做深入的讨论。我们只需要知道下面这些：
元数据本身的数据结构也是需要被定义和规范的，定义和规范元数据的就是元模型，国际上元模型的标准是CWM(Common Warehouse Metamodel，公共仓库元模型)，一个成熟的元数据管理工具，需要支持CWM标准。

二、元数据是从哪里来的

在大数据平台中，元数据贯穿大数据平台数据流动的全过程，主要包括数据源元数据、数据加工处理过程元数据、数据主题库专题库元数据、服务层元数据、应用层元数据等。下图以一个数据中心为例，展示了元数据的分布范围：
在这里插入图片描述
业内通常把元数据分为以下类型：

技术元数据：库表结构、字段约束、数据模型、数据库细节等。
操作元数据： ETL程序(数据处理、调度、异常处理)
业务元数据：业务指标、业务代码、业务术语等。
管理元数据：数据所有者、数据质量定责、数据安全等级等。

元数据采集是指获取数据生命周期中的元数据，对元数据进行组织，然后将元数据写入数据库中的过程。要获取到元数据，需要采取多种方式，在采集方式上，使用包括数据库直连、接口、日志文件等技术手段，对结构化数据的数据字典、非结构化数据的元数据信息、业务指标、代码、数据加工过程等元数据信息进行自动化和手动采集。元数据采集完成后，被组织成符合CWM模型的结构，存储在关系型数据库中。

三、有了元数据，我们能做些什么

先看一张元数据管理的整体功能架构图，有了元数据，我们能做些什么，从这张图里一目了然：
在这里插入图片描述

元数据查看

一般是以树形结构组织元数据，按不同类型对元数据进行浏览和检索。可以浏览表的结构、字段信息、数据模型、指标信息等。通过合理的权限分配，元数据查看可以大大提升信息在组织内的共享。

数据血缘和影响性分析

数据血缘和影响性分析主要解决“数据之间有什么关系”的问题。因其重要价值，有的厂商会从元数据管理中单独提取出来，作为一个独立的重要功能。但是笔者考虑到数据血缘和影响性分析其实是来自于元数据信息，所以还是放在元数据管理中来描述。血缘分析指的是取到数据的血缘关系，以历史事实的方式记录数据的来源，处理过程等。
数据血缘分析对于用户具有重要的价值，如：当在数据分析中发现问题数据的时候，可以依赖血缘关系，追根溯源，快速地定位到问题数据的来源和加工流程，减少分析的时间和难度。

数据血缘分析的典型应用场景：某业务人员发现“月度营销分析”报表数据存在质量问题，于是向IT部门提出异议，技术人员通过元数据血缘分析发现“月度营销分析”报表受到上游FDM层四张不同的数据表的影响，从而快速定位问题的源头，低成本地解决问题。

除了血缘分析之外，还有一种影响性分析，它能分析出数据的下游流向。当系统进行升级改造的时候，如果修改了数据结构、ETL程序等元数据信息，依赖数据的影响性分析，可以快速定位出元数据修改会影响到哪些下游系统，从而减少系统升级改造带来的风险。从上面的描述可以知道：数据影响性分析和血缘分析正好相反，血缘分析指向数据的上游来源，影响性分析指向数据的下游。

影响性分析的典型应用场景：某机构因业务系统升级，在“FINAL_ZENT ”表中修改了字段：TRADE_ACCORD长度由8修改为64，需要分析本次升级对后续相关系统的影响。对元数据“FINAL_ZENT”进行影响性分析，发现对下游DW层相关的表和ETL程序都有影响，IT部门定位到影响之后，及时修改下游的相应程序和表结构，避免了问题的发生。由此可见，数据的影响性分析有利于快速锁定元数据变更带来的影响，将可能发生的问题提前消灭在萌芽之中。

数据冷热度分析

冷热度分析主要是对数据表的被使用情况进行统计，如表与ETL程序、表与分析应用、表与其他表的关系情况等，从访问频次和业务需求角度出发进行数据冷热度分析，用图表的方式展现表的重要性指数。

数据的冷热度分析对于用户有巨大的价值，典型应用场景：我们观察到某些数据资源处于长期闲置，没有被任何应用调用，也没有别的程序去使用的状态，这时候，用户就可以参考数据的冷热度报告，结合人工分析，对冷热度不同的数据做分层存储，以更好地利用HDFS资源，或者评估是否对失去价值的这部分数据做下线处理，以节省数据存储空间。

四、Data Catalog for Digital Transformation

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/4356-2020.pdf

1. Introduction

Companies are starting their digital transformation to add value to their data and to build a data-driven strategy. Unfortunately, most organizations govern their data in an ad hoc or firefighting manner across different parts of the business, and most of the time only within IT. Mapping data by building a data catalog is one of the first steps toward more governance and sustainability.

Gartner gives the following definition: “A data catalog maintains an inventory of data assets through the discovery, description, and organization of datasets. The catalog provides context to enable data analysts, data scientists, data stewards, and other data consumers to find and understand a relevant dataset for the purpose of extracting business value.
But Gartner’s definition does not really defer from historical metadata management as it does not focus on what makes data catalogs today so trendy: automation and collaboration. Excel-based or IT-driven data dictionaries are over, and the amount of data is too important and does require automation for scaling. Data consumers want to access data and to enrich, comment, and challenge the use and the quality of data.

Let’s dare to give a definition: “A data catalog is an automated collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need. It also serves as an inventory of available data and provides information to evaluate the fitness of data for intended uses.” In few words, a data catalog is your organization metadata social network!

2. Data Catalog Objectives and Benefits

Objectives

allow data citizens to find the data they need in an efficient way
empower organizations to quickly invent, discover, manage, and understand all their data
move from tribal to centralized and crowdsource knowledge
ingest new data sets and the use of new of data faster
become the foundational layer for driving data governance, quality, and information security policies
foster collaboration between business and IT to contribute to the shared understanding of the information

Benefits

data catalogs contribute to increasing efficiency, as they allow analysts to short cut the time, they need to qualify the correct data.
They also support data governance and risk mitigation by identifying personal and sensitive data, and by allowing you to establish and spread best practices in terms of data management and data quality.
Finally, data management is simplified as new data sources can be onboarded more quickly and key assets can be easily identified and monitored, as redundant and untapped data can be detected and remediated. In the end, the data ecosystem gets rationalized and more agile.

3. Data Catalog Features

Most of the existing solutions rely on the following four main components:

A flexible data model for storing the metadata objects and their relationships
A set of data discovery services that allow you to extract metadata from structured and unstructured data sources as well as enriching (discovering, scoring) metadata with additional information/insight
Search and indexing services that allow you to make the information available as quick as possible and to formulate complex search queries
An intuitive, easy to use, and collaborative user interface so that any kind of user can search and find what he or she needs