数据仓库架构:传统&云的对比

Data Warehouse Architecture: Traditional vs. Cloud

Data warehouse architecture is changing. Learn about traditional EDW vs. cloud-based architectures with lower upfront cost, improved scalability and performance.

数据仓库一直在变化。了解传统的企业级数据仓库和低成本、可伸缩和高性能基于云的数据仓库。

data warehouse is an electronic system that gathers data from a wide range of sources within a company and uses the data to support management decision-making.

数据仓库是一个从公司内各种数据源收集数据,用数据支持管理决策的信息系统。

Companies are increasingly moving towards cloud-based data warehouses instead of traditional on-premise systems. Cloud-based data warehouses differ from traditional warehouses in the following ways:

很多公司正不断的采用基于云的数据仓库来替代传统的本地系统。基于云的数据仓库和传统的数据仓库有下面的不同:

  • There is no need to purchase physical hardware.
  • It’s quicker and cheaper to set up and scale cloud data warehouses.
  • Cloud-based data warehouse architectures can typically perform complex analytical queries much faster because they use massively parallel processing (MPP).
  • 不需要购买硬件
  • 更快更便宜的搭建有伸缩性的云数据仓库
  • 基于云的数据仓库架构能更快的执行复杂的分析查询,因为它可以用大规模并行处理

The rest of this article covers traditional data warehouse architecture and introduces some architectural ideas and concepts used by the most popular cloud-based data warehouse services.

For more details, see our page about data warehouse concepts in this guide.

这篇文章的剩余部分将介绍传统数据仓库架构和一些被非常流行的基于云数据仓库服务使用的架构思想和概念

更多细节,参见 data warehouse concepts 

Traditional Data Warehouse Architecture

 

The following concepts highlight some of the established ideas and design principles used for building traditional data warehouses.

传统数据仓库架构

下面的概念强调构建传统数据仓库用到的一些思想和设计原则

Three-Tier Architecture

三层架构

Traditional data warehouse architecture employs a three-tier structure composed of the following tiers.

传统的数据仓库架构采用了以下三层架构

  • Bottom tier: This tier contains the database server used to extract data from many different sources, such as from transactional databases used for front-end applications.
  • 底层:这层包括了从多种不同数据源采集数据的数据库服务器,比如从用于前端应用的事务型数据库。
  • Middle tier: The middle tier houses an OLAP server, which transforms the data into a structure better suited for analysis and complex querying. The OLAP server can work in two ways: either as an extended relational database management system that maps the operations on multidimensional data to standard relational operations (Relational OLAP), or using a multidimensional OLAP model that directly implements the multidimensional data and operations.
  • 中间层:中间层是联机在线分析(OLAP)服务器。OLAP会把数据变换为更合适分析和复杂查询的结构化数据。OLAP服务器有两种工作模式。一种是可扩展的关系型数据库系统,把对多维数据的操作映射到标准的关系型操作(关系型OLAP)。另外一种是采用多维OLAP模型,直接实现多维数据和操作。
  • Top tier: The top tier is the client layer. This tier holds the tools used for high-level data analysis, querying reporting, and data mining.
  • 顶层:顶层是客户层。这层包含了用于高层数据分析,报表查询和数据挖掘的工具。

Kimball vs. Inmon

Two pioneers of data warehousing named Bill Inmon and Ralph Kimball had different approaches to data warehouse design.

两位数据仓库的先驱分别叫Bill Inmon 和Ralph Kimball 。他们用不同的方法来设计仓库。

Ralph Kimball’s approach stressed the importance of data marts, which are repositories of data belonging to particular lines of business. The data warehouse is simply a combination of different data marts that facilitates reporting and analysis. The Kimball data warehouse design uses a “bottom-up” approach.

Ralph Kimball的方法强调数据集市的重要性。数据集市是属于不同业务线的数据存储库。数据仓库仅仅是便于出报表和分析的数据集市的集合. Kimball  数据仓库设计用了“自下而上”的方法。

Bill Inmon regarded the data warehouse as the centralized repository for all enterprise data. In this approach, an organization first creates a normalized data warehouse model. Dimensional data marts are then created based on the warehouse model. This is known as a top-down approach to data warehousing.

Bill Inmon 认为数据仓库是企业数据的集中存储库。这种方法,一个组织先创建一个规范的数据仓库模型。 维度数据集是基于数据仓库模型的。 这就是“自上而下”的方法构建数据仓库。

Data Warehouse Models

数据仓库模型

In a traditional architecture there are three common data warehouse models: virtual warehouse, data mart, and enterprise data warehouse:

在一个传统架构里有三种常见的数据仓库模型:虚拟数据库,数据集市和企业数据仓库

  • A virtual data warehouse is a set of separate databases, which can be queried together, so a user can effectively access all the data as if it was stored in one data warehouse.
  • 一个虚拟数据仓库是一个分离的数据库集合。他们可以一起查询,所以用户可以高效的访问这些数据好像他们存在一个数据仓库。
  • A data mart model is used for business-line specific reporting and analysis. In this data warehouse model, data is aggregated from a range of source systems relevant to a specific business area, such as sales or finance.
  • 一个数据集市模型用于特殊业务线的报表和分析。在这种数据仓库模型里面,数据整合自很多和这个业务领域相关的源系统,比如销售和财务
  • An enterprise data warehouse model prescribes that the data warehouse contain aggregated data that spans the entire organization. This model sees the data warehouse as the heart of the enterprise’s information system, with integrated data from all business units.
  • 一个企业数据仓库模型规定数据仓库包含了涉及整个企业的聚合数据。这种模型认为数据仓库是整个企业信息系统的核心,有所有的业务单元的集成数据

Star Schema vs. Snowflake Schema

The star schema and snowflake schema are two ways to structure a data warehouse.

The star schema has a centralized data repository, stored in a fact table. The schema splits the fact table into a series of denormalized dimension tables. The fact table contains aggregated data to be used for reporting purposes while the dimension table describes the stored data.

Denormalized designs are less complex because the data is grouped. The fact table uses only one link to join to each dimension table. The star schema’s simpler design makes it much easier to write complex queries.

The snowflake schema is different because it normalizes the data. Normalization means efficiently organizing the data so that all data dependencies are defined, and each table contains minimal redundancies. Single dimension tables thus branch out into separate dimension tables.

The snowflake schema uses less disk space and better preserves data integrity. The main disadvantage is the complexity of queries required to access data—each query must dig deep to get to the relevant data because there are multiple joins.

ETL vs. ELT

ETL and ELT are two different methods of loading data into a warehouse.

ETL和ELT是加载数据到数据仓库的两种不同方法

Extract, Transform, Load (ETL) first extracts the data from a pool of data sources, which are typically transactional databases. The data is held in a temporary staging database. Transformation operations are then performed, to structure and convert the data into a suitable form for the target data warehouse system. The structured data is then loaded into the warehouse, ready for analysis.

抽取,转换,加载(ETL)首先从数据源(一般是事务型数据库)抽取数据。这些数据被临时放在一个暂存数据库。在此进行数据转换,把数据结构化并转换为目标数据仓库需要的形式。结构化数据被加载到数据仓库准备分析。

With Extract Load Transform (ELT), data is immediately loaded after being extracted from the source data pools. There is no staging database, meaning the data is immediately loaded into the single, centralized repository. The data is transformed inside the data warehouse system for use with business intelligence tools and analytics.

关于抽取加载转换(ELT),在从原数据池抽取以后,数据被直接加载到目标数据仓库。没有暂存数据库,意味着数据被立即加载到了一个集中的数据仓库。在数据仓库里,数据被转换为商业智能工具和分析用到的数据。

Organizational Maturity

组织的成熟度

The structure of an organization’s data warehouse also depends on its current situation and needs.

一个组织的数据仓库架构也取决于它的现状和需要。

The basic structure lets end users of the warehouse directly access summary data derived from source systems and perform analysis, reporting, and mining on that data. This structure is useful for when data sources derive from the same types of database systems.

最基本的架构是让数据仓库的终端用户可以直接访问源自各种数据源的汇总数据,并执行数据分析,出报表,挖掘数据。对那种数据源都来自相同类型的数据库系统,这种架构很有用。

A warehouse with a staging area is the next logical step in an organization with disparate data sources with many different types and formats of data. The staging area converts the data into a summarized structured format that is easier to query with analysis and reporting tools.

对于有很多不同类型和不同格式数据的数据源的组织, 暂存区是一个数据仓库的逻辑步骤。

A variation on the staging structure is the addition of data marts to the data warehouse. The data marts store summarized data for a particular line of business, making that data easily accessible for specific forms of analysis. For example, adding data marts can allow a financial analyst to more easily perform detailed queries on sales data, to make predictions about customer behavior. Data marts make analysis easier by tailoring data specifically to meet the needs of the end user.

对暂存区架构的数据仓库一个变化是添加数据集市。数据集市存储了不同行业的汇总数据,使特定分析更容易访问相关数据。比如,添加数据集市可以让财务分析更容易实现对销售数据的详细查询,以此来预测客户的行为。通过专门制作符合用户需求的数据,数据集市让分析更加的容易。

New Data Warehouse Architectures

新数据仓库架构

In recent years, data warehouses are moving to the cloud. The new cloud-based data warehouses do not adhere to the traditional architecture; each data warehouse offering has a unique architecture.

This section summarizes the architectures used by two of the most popular cloud-based warehouses: Amazon Redshift and Google BigQuery.

近年来,数据仓库正移向云端。新的数据仓库不需要符合传统的架构,每个数据仓库都有独特的架构。

这部分总结了两个十分流行的基于云的数据仓库架构。亚马逊Redshift 和 谷歌的BigQuery

Amazon Redshift

Amazon Redshift is a cloud-based representation of a traditional data warehouse.

Amazon Redshift 是一个基于云的传统数据仓库的代表

Redshift requires computing resources to be provisioned and set up in the form of clusters, which contain a collection of one or more nodes. Each node has its own CPU, storage, and RAM. A leader node compiles queries and transfers them to compute nodes, which execute the queries.

Redshift 需要提供计算资源来建立集群。集群包含一个或多个节点,每个节点包含它自己的CPU,存储和内存。一个主节点编译查询,转移他们到执行查询的计算节点上。

On each node, data is stored in chunks, called slices. Redshift uses a columnar storage, meaning each block of data contains values from a single column across a number of rows, instead of a single row with values from multiple columns.

在每个节点上,数据用块存储,叫切片。Redshift 用列式存储。意味着每个数据块包含这多行数据的一列,而不是一个数据的多列。

Source: AWS Documentation

Redshift uses an MPP architecture, breaking up large data sets into chunks which are assigned to slices within each node. Queries perform faster because the compute nodes process queries in each slice simultaneously. The Leader Node aggregates the results and returns them to the client application.

Redshift 用了MPP架构,把大的数据集分解成块,块被分配给每个节点内的切片。查询执行块是因为计算节点在每个分片并发的处理查询。主节点聚集这些结果,返回给客户端程序。

Client applications, such as BI and analytics tools, can directly connect to Redshift using open source PostgreSQL JDBC and ODBC drivers. Analysts can thus perform their tasks directly on the Redshift data.

客户端应用,比如Bi 和分析工具,可以用PostgreSQL JDBC 和ODBC驱动直接连接Redshift 。分析师可以执行他们的任务在Redshift 的数据。

Redshift can load only structured data. It is possible to load data to Redshift using pre-integrated systems including Amazon S3 and DynamoDB, by pushing data from any on-premise host with SSH connectivity, or by integrating other data sources using the Redshift API.

Redshift 只能加载结构化数据。它可以用预集成系统(包括S3 和DynamoDB)加载数据到Redshift 。用SSH连接从任何本地主机推送数据,用Redshift API来集成其它数据源。

Google BigQuery

BigQuery’s architecture is serverless, meaning Google dynamically manages the allocation of machine resources. All resource management decisions are, therefore, hidden from the user.

BigQuery架构是无服务状态的,这意味着Google 动态的管理所有机器资源的分配。所有资源管理的决定对用户来说是隐藏的

BigQuery lets clients load data from Google Cloud Storage and other readable data sources. The alternative option is to stream data, which allows developers to add data to the data warehouse in real-time, row-by-row, as it becomes available.

BigQuery 可以从Google 云存储或者其它可以读的数据源加载数据。另一个方法是流数据,让开发者实时的一行一行地添加数据到它的数据仓库。

BigQuery uses a query execution engine named Dremel, which can scan billions of rows of data in just a few seconds. Dremel uses massively parallel querying to scan data in the underlying Colossus file management system. Colossus distributes files into chunks of 64 megabytes among many computing resources named nodes, which are grouped into clusters.

BigQuery 用的查询引擎叫Dremel。Dremel可以扫描上亿条数据只要几秒钟。Dremel用大规模并发查询来扫描基于Colossus 文件管理系统上的数据。Colossus 把文件划分为64M的块,分配到很多叫Node(节点)的计算资源。这些nodes 组成了clusters(集群)

Dremel uses a columnar data structure, similar to Redshift. A tree architecture dispatches queries among thousands of machines in seconds.

类似于Redshift,Dremel 采用列数据结构。一个树架构在数秒间分配查询到数千台机器

Image source

Simple SQL commands are used to perform queries on data.

简单的SQL命令用来执行数据查询。

Panoply

Panoply provides end-to-end data management-as-a-service. Its unique self-optimizing architecture utilizes machine learning and natural language processing (NLP) to model and streamline the data journey from source to analysis, reducing the time from data to value as close as possible to none.

Panoply’s smart data infrastructure includes the following features:

  • Analyzing of queries and data – identifying the best configuration for each use case, adjusting it over time, and building indexes, sortkeys, diskeys, data types, vacuuming, and partitioning.
  • Identifying queries that do not follow best practices – such as those that include nested loops or implicit casting – and rewrites them to an equivalent query requiring a fraction of the runtime or resources.
  • Optimizing server configurations over time based on query patterns and by learning which server setup works best. The platform switches server types seamlessly and measures the resulting performance.

Beyond Cloud Data Warehouses

Cloud-based data warehouses are a big step forward from traditional architectures. However, users still face several challenges when setting them up:

  • Loading data to cloud data warehouses is non-trivial, and for large-scale data pipelines, it requires setting up, testing, and maintaining an ETL process. This part of the process is typically done with third-party tools.
  • Updates, upserts, and deletions can be tricky and must be done carefully to prevent degradation in query performance.
  • Semi-structured data is difficult to deal with - needs to be normalized into a relational database format, which requires automation for large data streams.
  • Nested structures are typically not supported in cloud data warehouses. You will need to flatten nested tables into a format the data warehouse can understand.
  • Optimizing your cluster—there are different options for setting up a Redshift cluster to run your workloads. Different workloads, data sets, or even different types of queries might require a different setup. To stay optimal you’ll need to continually revisit and tweak your setup.
  • Query optimization—user queries may not follow best practices, and consequently will take much longer to run. You may find yourselves working with users or automated client applications to optimize queries so that the data warehouse can perform as expected.
  • Backup and recovery—while the data warehouse vendors provide numerous options for backing up your data, they are not trivial to set up and require monitoring and close attention.

Panoply is a Smart Data Warehouse that adds a layer of automation that takes care of all of the complex tasks above, saving valuable time and helping you get from data to insight in minutes.

Learn more about Panoply’s smart data warehouse tools.

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
随着数据量的不断增加,数据仓库的作用越来越重要。在数据仓库架构中,主流的架构包括了单层架构、两层架构、三层架构和多层架构。下面我将对这几种架构进行分析比较。 1. 单层架构 单层架构是最简单的数据仓库架构,它的结构相对较为简单,数据直接从源系统导入到数据仓库中。由于单层架构没有中间层,所以它具有较高的性能和较低的复杂度。但是单层架构的缺点也很明显,因为数据直接从源系统导入,所以数据的清洗和整合都需要在数据仓库中进行,导致数据处理的难度加大。 2. 两层架构 两层架构相比单层架构,增加了中间层。在两层架构中,数据从源系统导入到中间层,经过数据清洗、转换和整合后,再导入到数据仓库中。两层架构的中间层可以将不同数据源的数据进行整合,使数据更加准确和一致。另外,中间层可以提高数据处理的效率,减少对源系统的影响。但是两层架构的缺点也很明显,因为中间层的存在,增加了架构的复杂度和维护成本。 3. 三层架构 三层架构是在两层架构的基础上增加了数据存储层。在三层架构中,数据从源系统导入到中间层,经过数据清洗、转换和整合后,再导入到数据存储层中。数据存储层主要是用来存储数据仓库的数据,包括原始数据、中间结果和汇总结果等。通过将数据存储层与中间层分离,可以实现数据的分层管理和优化存储。三层架构的优点是更加灵活和可扩展,但是也需要更多的维护成本。 4. 多层架构 多层架构是在三层架构的基础上增加了数据访问层和应用层。在多层架构中,数据从源系统导入到中间层,经过数据清洗、转换和整合后,再导入到数据存储层中。数据访问层主要是用来提供数据访问的接口,而应用层则是用来支持不同的应用需求,包括报表

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值