数据仓库的权威介绍

数据工程 (DATA ENGINEERING)

With the exponential increase in the quantity of data and a fairly substantial increase in the variety of data, businesses were posed with the problem of architecting and maintaining data warehouses and data marts that can keep up with the pace at which software development is done in agile teams. The big data explosion that happened in the last decade has forced us to rethink how we store, retrieve and analyse data.

变送器处于数据的数量和在各种数据相当大幅增加呈指数增长,企业用能跟上在其软件开发中所做的步伐构建和维护数据仓库和数据集市的所提出的问题敏捷团队。 过去十年中发生的大数据爆炸迫使我们重新考虑如何存储,检索和分析数据。

The same reasons that led us to shift from relational models to dimensional models have led us to Data Vault, but with a distinction — Data Vault is not necessarily a replacement or an upgrade over Dimensional modelling for OLAP. It is a better approach for several use cases, not for others. The same can be said for dimensional modelling.

导致我们从关系模型转向维度模型的相同原因也导致我们进入Data Vault,但有一个区别-Data Vault不一定是OLAP的维度模型的替代或升级。 对于某些用例,而不是其他用例,这是一种更好的方法。 尺寸建模也可以这样说。

尺寸建模的替代方法 (An Alternative to Dimensional Modelling)

The need for a new modelling technique arose because of the ever-changing nature of software applications and the needs of data teams. In traditional data warehouse modelling techniques, it gets costly to make changes on a very frequent basis. Data Vault, on the other hand, was designed to solve such problems.

由于软件应用程序的不断变化的性质和数据团队的需求,因此需要一种新的建模技术。 在传统的数据仓库建模技术中,频繁进行更改会增加成本。 另一方面,Data Vault旨在解决此类问题。

Agile software teams need agile data teams to complement them.

敏捷软件团队需要敏捷数据团队对其进行补充。

Traditional data warehouse modelling techniques like the star schema are still very relevant and useful. Data Vault has just attempted to take ideas from both, dimensional modelling with star schema and relational modelling in the third normal form (3NF).

传统的数据仓库建模技术(例如星型模式)仍然非常相关且有用。 Data Vault只是试图从具有星型模式的尺寸建模和第三范式(3NF)的关系建模中汲取灵感。

As important as. supporting fast-paced application development is seamless scalability. Cloud computing has made deploying and scaling data warehouses but with an inefficient model and overkill of maintenance, a cloud data warehouse can also suffer a lot. If the model is bad, it doesn’t matter how much resources you allocate to the warehouse, queries will perform badly.

跟一样重要。 支持快速应用程序开发的是无缝可伸缩性。 云计算已经实现了数据仓库的部署和扩展,但是由于模型效率低下和维护过高,云数据仓库也可能遭受很多损失。 如果模型不好,则分配给仓库的资源并不重要,查询将表现不佳。

A case study at Diamler — moving from a star schema to data vault.

Diamler的案例研究-从星型架构过渡到数据仓库

什么使数据仓库 (What Makes a Data Vault)

The creator of DataVault, Dan Linsteadt, says the following about his approach to modelling —

DataVault的创建者Dan Linsteadt说了以下有关他的建模方法的信息-

The Data Vault is a detail-oriented, history-tracking and uniquely linked set of normalized tables that support one or more functional areas of business.

Data Vault是面向细节,历史跟踪和唯一链接的一组标准化表,它们支持一个或多个业务功能领域。

With these principles in mind, let’s understand two different types of vaults that exist in the Data Vault domain — The Raw Vault and The Business Vault. The Raw Vault is the first layer where data from heterogeneous sources is loaded into the Data Vault. This data is unfiltered. On the other hand, the Business Vault is nothing but an (sometimes optional) extension to the raw vault. The business vault has all the business logic applied to the tables such as case-when statements, multi-column calculations, breaking & joining multiple columns and so on. The purpose of the business vault is to make data more accessible, more understandable for the business user.

牢记这些原则,让我们了解Data Vault域中存在的两种不同类型的保管库-Raw Vault和Business Vault。 Raw Vault是将来自异构源的数据加载到Data Vault的第一层。 此数据未过滤。 另一方面,Business Vault只是对原始Vault的(有时是可选的)扩展。 业务保险库将所有业务逻辑应用于表,例如,case-when语句,多列计算,断开和连接多列等。 业务库的目的是使数据对业务用户更易于访问,更易理解。

A

一个

There are 5 different types of tables in the Data Vault model. The original Data Vault spec had only 3 of them, 2 more were introduced with Data Vault 2.0 —

Data Vault模型中有5种不同类型的表。 原始的Data Vault规范只有3个,而Data Vault 2.0又引入了2个-

  • Hubs — contain just the business keys (natural keys) used by the business. Business keys should not be surrogates (like in some dimensional models), they should have meaning for the business.

    集线器 -仅包含企业使用的企业密钥(自然密钥)。 业务密钥不应该是替代的(例如在某些维模型中),它们应该对业务有意义。

  • Link — contain unique relationships between various business keys. It is essentially a mapping table with no real data except the mapping between different hubs (using business keys).

    链接 -包含各种业务密钥之间的唯一关系。 它实质上是一个映射表,除了不同集线器之间的映射(使用业务密钥)之外,没有任何实际数据。

  • Satellite— contain all the non-key, descriptive data— current and historical. Both Hubs and links tables should have several satellite tables ideally.

    卫星 -包含所有非关键性描述性数据-当前和历史。 理想情况下,集线器和链接表都应具有多个附属表。

  • PIT — It’s an upgrade from the satellite tables, especially when the hub used in the queries has more than one satellite tables to join.

    PIT-这是对附属表的升级,尤其是当查询中使用的集线器要加入多个附属表时。

  • Bridge — just like a PIT table is used to extend satellites, a bridge table is used to extend links to help with joins that have several hubs & links.

    -就像使用PIT表扩展卫星一样,桥表也用于扩展链接以帮助具有多个集线器和链接的联接。

To summarize, the Data Vault model is an alternative to the more traditional approaches in data warehouse modelling for the following reasons

总之,由于以下原因,Data Vault模型是数据仓库建模中更传统方法的替代方法

  • keep up with the variety and quantity of data coming in from the source systems, i.e., scalability

    跟上来自源系统的数据的多样性和数量,即可伸缩性
  • keep up with the frequent changes that are required because of agile software development practices

    紧跟敏捷软件开发实践所需的频繁更改
  • give the business user a contextual view of data where they could easily explore and analyse data generated from various sources

    为业务用户提供数据的上下文视图,使他们可以轻松地浏览和分析从各种来源生成的数据

进一步探索 (For Further Exploration)

Data Vault modelling in a Complex Data Environment
复杂数据环境中的数据仓库建模

翻译自: https://towardsdatascience.com/the-definitive-intro-to-data-vault-8d43eaad1c38

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值