The Microsoft Data Warehouse Toolkit读书笔记(六):第五章 ETL系统设计

最新推荐文章于 2021-11-23 14:30:57 发布

brightgems

最新推荐文章于 2021-11-23 14:30:57 发布

阅读量1.2k

点赞数

分类专栏：数据仓库与商业智能文章标签： microsoft 读书 sql server transformation database processing

本文链接：https://blog.csdn.net/brightgems/article/details/2031002

版权

数据仓库与商业智能专栏收录该内容

42 篇文章 0 订阅

订阅专栏

第一章 ETL系统设计

一、概述

n ETL设计规范

Ø High-level map

Ø Detailed data profiling report

Ø Source-to-target mapping

Ø Strategy document for extracting data

Ø Strategy document for updating Analysis Services database

n ETL系统物理设计与架构

n SSIS简介

二、 SSIS简介

概述

Your ETL team will develop one or more Integration Services packages to populate your DW/BI system. A package is analogous to a computer program or a script. When your DW/BI system is in production, you can execute a package to perform a database maintenance task, to load a table, or to populate an entire business process dimensional model.

An Integration Services package contains one or more tasks.

2．控制流

常用的预定义任务有：

Ø Bulk Insert task

与BULK和BCP命令类似，适用于从文件中加载纯净的数据，如果文件中的数据需要变换，应使用data flow task。

Ø Execute SQL task

用于在数据库(不限于SQL SERVER)中执行一段SQL，常用场景是查询数据或元数据并放入变量。

Ø File Transfer Protocol and File System tasks

传输文件，文件系统命令

Ø Execute Package, Execute DTS2000 Package, and Execute Process tasks

Execute Package可将一个大的流程分隔成许多小的SSIS包，并为它定义Master Package执行它。

Execute Process task用于执行操作系统命令。

Ø Send Mail task

Ø Script and ActiveX Script tasks

The Script task uses Visual Basic .NET from the Visual Studio for Applications environment. Or, you can use any .NET language to create a custom task that will become available in the list of control flow tasks.

Ø Data Mining and Analysis Services Processing tasks

The Data Mining task runs an Analysis Services data mining query and saves the results to a table. The Analysis Services Processing task will launch processing on Analysis Services dimensions and databases. Use the Analysis Services DDL task to create new Analysis Services partitions, or perform any data definition language operation. There are Data Mining and Analysis Services Data Flow transforms, as well as these control flow tasks. Use the Analysis Services control flow tasks to fully or incrementally update your databases and models. The use of the corresponding Data Flow transforms is discussed in the next section.

Ø XML and Web Services tasks

The XML task retrieves XML documents and applies XML operations to them. Use the XML task to validate an XML document against its XSD schema, or to compare or merge two XML documents. Use the Web Services task to make calls to a web service

Ø Message Queue, WMI Data Reader, and WMI Event Watcher tasks

The Message Queue task uses Microsoft Message Queue (MSMQ) to manage tasks in a distributed system. You can use the WMI tasks to coordinate with the Windows Management Interface, and automate the execution of a package or set of tasks when a specific system event has occurred.

Ø ForEach Loop, For Loop, and Sequence containers

Use containers like the ForEach Loop and For Loop to execute a set of tasks multiple times. For example, you can loop over all the tables in a database, performing a standard set of operations like updating index statistics

The Sequence container groups together several tasks. Use it to define a transaction boundary around a set of tasks so they all fail or succeed together.

The Sequence container groups together several tasks. Use it to define a transaction boundary around a set of tasks so they all fail or succeed together. Or, use it simply to reduce the clutter on the design surface by hiding the detailed steps within the sequence

Ø Data Flow task

3．数据流

A data flow task contains one or more data sources, zero or more transformation steps, and zero or more data destinations. The simplest useful data flow task would have one source and one destination: It would copy data from one place to another. Most data flow tasks are a lot more complicated than that, with multiple sources, transformations, and even destinations

A．Data Sources

Ø Use the OLE DB Source adapter to extract data from any source that has an OLE DB provider, including SQL Server, Oracle, and DB2

Ø Use the DataReader Source adapter to extract data from a .NET Provider like ADONET:SQL or ADONET:ORACLE. If performance is vital, use the DataReader source adapter, which has slightly better performance than the OLE DB adapter, but only the most basic user interface.

Ø The Flat File Source adapter pulls data from a flat file. You could use an OLE DB or ODBC flat file driver, but we prefer the Flat File Source because it has a better UI and handles bad data more flexibly.

Ø The Raw File Source adapter pulls data from a raw data file, which is a format unique to Integration Services

不做任何变换和解析。不能定义ERROR FLOW，不能共享连接。

Ø The XML Source adapter pulls data from an XML source. The XML source can be either in a file or in a package variable

Ø You can define a custom source adapter to consume data from a unique source, like a file whose rows have different formats under different conditions.

B．Data Destinations

Ø You can use the SQL Server destination adapter to load data into the SQL Server relational database, instead of the OLE DB adapter. There are several differences between the two adapters:

o The SQL Server adapter can write data only to a SQL Server relational database running on the same machine as Integration Services.

o The SQL Server adapter requires that data be formatted perfectly; it skips some data checking and conversion steps that the OLE DB adapter performs for you.

o The SQL Server adapter can perform faster than the OLE DB adapter because it skips those data checking and conversion steps.

o Unlike the OLE DB adapter, the SQL Server adapter doesn’t let you set a batch size or commit size for bulk loading. In some scenarios, setting these parameters optimally enables the OLE DB adapter to out-perform the SQL Server adapter.

Ø The OLE DB destination adapter will load data into any OLE DB target database or table, like SQL Server, Oracle, DB2, or even Excel

Ø The Flat File and Raw File destination adapters write the data stream to the file system.

Ø Use the Data Reader destination adapter to generate a Reporting Services report directly from an Integration Services package.

Ø The Recordset destination adapter populates an ADODB recordset object held in memory.

Ø The SQL Mobile destination adapter is appropriate for pushing data down to SQL Server Mobile Edition databases.

Ø The Dimension and Partition Processing destination adapters push the pipeline data into an Analysis Services dimension or fact partition, without the usual intermediate step of writing the data to a relational database.

C．Data Transformations

Ø Sort and Aggregate transforms perform high performance sorting and aggregation

Ø Conditional Split and Multicast transforms create multiple output streams from one data stream.

The Multicast transform efficiently replicates a stream, perhaps for parallel processing. The Conditional Split transform sends each input row to one and only one output stream; Multicast sends each input row to each output stream.

Ø The Union All, Merge, Merge Join, and Lookup transforms all join multiple data streams.

Use Union All to combine multiple streams with similar structures, perhaps from an upstream Conditional Split, or for Customer records from multiple source systems.

Merge is very similar to Union All, but it interleaves sorted rows. If you don’t care about the order of the rows in the output stream, Union All is both more forgiving and more efficient

Merge Join is a lot like a database join: it merges rows based on columns in common, and can perform a left, full, or inner join.

The Lookup transform is very useful for decoding a code from the source data into a friendly label using information stored in a database table.

Ø The Character Map, Copy/Map, Data Conversion, and Derived Column transforms all do basic transformations.

Character Map works only on string data, and will perform transformations like changing case, and changing character width for international strings. Copy/Map simply creates a second copy of an existing column in the stream. Data Conversion changes the data type of one or more columns, for example translating a date string to a true date. Finally, the Derived Column transform is where the most interesting transformation logic occurs

Ø The Slowly Changing Dimension transform

使用向导设置缓慢变化维

Ø Use the OLE DB Command transform to update or delete rows in a target table, based on the data in the data flow.

The OLE DB Command is the only way to execute an UPDATE statement from within the Data Flow

Ø The Row Count and Audit transforms are most useful for developing process metadata about the data flow.

Use the Row Count transform to efficiently count the rows that will be loaded to the target, or that flow into the error handler. Audit captures information about the system environment at the time the data flow task is run, including the computer name, user id, and execution start time.

Ø You can use the Percentage Sampling and Row Sampling transforms during package testing, to operate against a subset of the source data.

Ø The Pivot and Unpivot transforms

Ø Use the Data Mining Model Training and Data Mining Query transforms to incorporate Analysis Services data mining technology into your ETL application.

Ø Fuzzy Grouping and Fuzzy Lookup transforms employ fuzzy logic algorithms that were developed by Microsoft Research.

模糊查询

Ø Term Extraction and Term Lookup transforms employ text-mining algorithms that were also developed by Microsoft Research.

Ø The File Extractor and File Injector transforms are used primarily to strip out (extract) text and image data from a data flow and put it into a file or files, or to add such data from files into the data flow.

Ø The Script component provides a simple mechanism for creating a custom transformation

D．Error Flows

The error flow is a data flow just like the normal flow. You can transform the data to fix the error, and hook it back up with the normal flow. You can write the error flow to a table or file.

Warning

By default, all steps in the data flow are set up to fail if an error is encountered. To make use of error flows, you’ll need to modify this default behavior by setting up an error flow and changing the error flow characteristics of the transform.

4. Concepts for Dynamic Packages

Services implements a rich expression language that is used in control flow and also in the data flow transform. The results of expressions are often placed into variables, which can be shared throughout the network of packages.

A. Expressions

SSIS自带一种表达式语言

B. Variables

You can define variables within your package. Variables can be scoped to any object: package-wide, within a container like a sequence, or scoped down as specific as a single task.

User-defined variables are the most interesting, but there are a ton of system variables available for you to examine and, possibly, log to your metadata tables.

C. Configurations

You can overwrite most of the settings for Integration Services objects by supplying a configuration file at runtime. Variables’ initial values can also be set from a configuration file

Integration Services can load configurations from various sources: SQL Server, an XML file, a parent package variable, or a Windows environment variable or registry entry. XML configuration files are easily managed in your source control system

Use the Configuration Wizard to create the basic structure of a configuration file. Launch that wizard by choosing the SSIS → Package Configurations menu item in BI Studio.

5．Event Handlers

Write an OnPostExecute event handler to clean up temporary storage and other settings when a package or task finishes. Write a package OnPreExecute event handler to check whether there’s enough disk space available for the package to run. Write an OnError event handler to restore the data warehouse database to a consistent state, and notify an operator of the problem.

Events percolate up until they find an event handler. For example, if you define an OnProgress event handler at the package level, the individual tasks in that package will use that same event handler unless they have an OnProgress event handler of their own.

二、高层设计

A summary of the high-level planning process includes the following items:

· Develop a high-level map.

· If helpful, build a sandbox source system.

· Perform detailed data profiling and complete the source-to-target mapping.

· Determine how often you’ll load each table.

· Determine how much historical data you’ll load for each table.

· Develop a strategy for partitioning the relational and Analysis Services fact tables.

· Design a strategy for extracting data from each source system.

· If necessary, de-duplicate key data elements like person and organization.

· If necessary, develop a strategy for distributing dimension tables across multiple database servers.

1． Develop the First Draft High-Level Map

高层设计应当与维度模型保持一致。它应能描述出模型中的表从哪些源取数，以及表之间的依赖关系，如在对事实表加载前应先加载完维度表。

Figure 5.4: High-level map for Customer (both individual and reseller)

在上图中提供的信息有：来源表，抽取策略，加载周期，初始加载预计行数，每日预计行数，维度变化逻辑。

黑色的圆圈代表的是数据质量检查点

2．构建沙箱内的源系统

DW/BI系统一般只能访问源系统的镜像数据库。

If you don’t have the disk space to restore the entire source database and then prune away the unwanted data, instead create a consistent subset using a process like the following:

1) In Management Studio, script the creation of all relevant table and view objects in the source database, including keys and indexes. You probably do not want to script triggers.

2) Create the sandbox database and run the script to instantiate empty versions of all the objects.

3) In Management Studio, use the Import/Export Data Wizard to quickly generate an Integration Services package that will load data from each table in the source database to its corresponding table in the sandbox database. Be sure to omit moving data for the very large tables.

4) Enable identity insert if any column uses the IDENTITY keyword.

5) For the handful of very large tables, develop a custom Integration Services package that pulls over only the desired subset of data.

6) For the large tables, drop indexes and constraints so the load can proceed as fast as possible. Be sure to restore constraints and indexes later. If you don’t restore constraints you may spend a lot of time chasing down referential integrity violations that exist only in your sandbox.

7) Check your work by generating a CREATE script for both the source and sandbox databases. Use windiff or your favorite tool to compare the two scripts. This will ensure all keys, indexes, and constraints are the same.

8) Further check your work by comparing table sizes and rowcounts.

3． Perform Data Profiling（数据抽查）

大多使用数据抽查工具，找出原数据质量问题，并在ETL设计时考虑如何处理错误数据。

4． Complete the Source-to-Target Mapping

对下面的数据转换应在文档中做注释：

· Parsing a source column into multiple target columns: for example, breaking out area code from phone number

· Combining multiple source columns into a single target column: for example, constructing a customer’s full name

· Correcting any data errors discovered during data profiling

· Changing case: for example, changing a name field from all capitals to proper case

5．加载频率

大多数客户要求至少是每日。

6．多少历史数据

1) DW中保存多少数据，每隔多少时间做一次数据转移

2) 初始加载有多少数据

7．使用分区

8．历史和增量加载

历史数据加载与增量加载的策略通常是不相同的，在设计ETL时应当参数化。

9．数据抽取策略

A. Extracting Data from Packaged Source Systems

需要使用自定义API抽取源系统的数据，如SAP

SQL Server 2005 ships with a .NET provider for SAP, which provides two methods of accessing SAP R/3 data:

· Write a SQL statement against a single SAP table. You can access column data from all types of SAP tables.

· Execute BAPI/RFC. With this access method you can execute a BAPI or custom RFC in the SAP server.

B. Extracting Directly from the Source Databases

Best practice says to keep the source queries as simple as possible, and explicitly perform transformations in the Integration Services data flow.

If you’re worried that your historical extract will not finish before the next ice age, consider dumping the large historical tables to flat files. Integration Services’ flat file source adapter performs very well.

C. Extracting Data for Incremental Loads(抽取增量数据)

If data volumes are large, you should work with the source system DBAs to craft a solution .Alternative approaches include using change data capture tools on the source system, adding triggers to the source system tables, or defining transactional replication on the source system

Most dimensions are small enough that it’s perfectly fine to pull all the dimension data into the ETL system on each incremental load and compare with the existing dimension. The Integration Services Slowly Changing Dimension Data Flow transform gracefully handles the full table comparisons for you and updates only dimension rows that have actually changed.

D. Extracting Historical Data

10．去重

Integration Services includes two general-purpose transforms that help you address data quality and de-duplication: Fuzzy Lookup and Fuzzy Grouping. These transforms do a surprisingly good job at identifying similar data and assigning a probability to the match’s correctness. However, these transforms work best if the data has been standardized first

也可使用第三方的数据清洗工具。

11．维度分布策略(分布式DW系统)

可考虑将维表分布在不同的机器上

with different business process dimensional models on different physical servers. If you must distribute the system horizontally, because of data volumes or political pressures, you’ll need to build a single ETL system that processes all dimensions. This system, called the dimension manager, is simply a set of Integration Services packages that manages the processing for all of your organization’s dimension tables. Distributed databases use local copies of the master dimensions provided by the dimension manager. By isolating the processing of dimensions and facts as we described in this chapter, you’ll build a more solid, flexible, and scalable ETL system, even if you don’t need to distribute dimensions.

Warning

To copy the entire dimension over to the distributed database, you need either to have no foreign key constraints defined between the fact table and dimension tables, or you need to disable and re-enable those constraints. For large dimensions, use replication to propagate only the changes.

不建议采用水平分隔，因为过于复杂；而是用垂直分隔方式，即将关系数据库，SSAS，SSRS放在不同机器上。

三、更新AS

建议使用AS作为数据仓库系统的主要展现层（没有集市的概念）。

The standard method of updating the Analysis Services database is to use the Analysis Services Processing task in Integration Services. You could add this task to the end of each package’s control flow。

An alternative, and usually better, approach is to create one package to perform all Analysis Services processing. Execute this package as the final step in the master package’s control flow

四、ETL物理设计

1. 系统架构与IS

Integration Services has two major components: a design environment that’s part of the BI Studio, and a runtime environment. The design environment is where you create, edit, and debug packages. The only way to remotely execute an Integration Services package in development/debugging mode is to use a remote desktop connection to the remote machine.

On test or production systems, use the DTExecUI or DTExec utility to execute packages. You can run on one server a package that is stored on a different server. In production, Integration Services packages can be located anywhere, and can be run on any server that has the Integration Services runtime executables.

2. Staging Area（准备区）

Some Integration Services ETL systems will not use a staging area at all. They will kick off a package at some predetermined time often in the middle of the night. The package pulls data from the source systems, performs all necessary transformations, populates the relational data warehouse database, and launches Analysis Services processing.

Other systems may execute a process on the source system to extract data to files, and then use Integration Services to move those files to the ETL server for processing. In this case the staging area would consist of some file space on both the source and ETL servers.

Another common architecture will be to design a set of Integration Services packages to perform extracts, and a second set of packages to perform transformations and loads. You might choose this architecture if you want to extract data more frequently than you want to transform and load it. In this case, the extract packages write the data as flat files or raw files (an Integration Services-specific format), and the staging area would consist of file space on the ETL server.

Fewer new ETL systems will use a relational staging area than previously. The flat file and raw file storage is efficient and relatively easy to manage. However, a staging relational database is a concept that’s comfortable and familiar to many ETL architects. There’s nothing wrong with using a database rather than files, if your load window can handle the overhead.

3. Package Storage

An Integration Services package is an XML file, and has the file extension .dtsx.

When you install a set of packages on the target system, you can store the packages in the file system or in the msdb database in an instance of SQL Server 2005. We prefer to store production packages in the file system Package Store. The Package Store is a special file system folder in which Integration Services can find packages.

4. Package Naming Conventions

Our convention is to start with the table name, and append an acronym for the major sets of processing that the package handles. Some examples follow:

· DimCustomer_ETLPi performs incremental extraction, transformation, load, and Analysis Services dimension processing (P) for the customer dimension (DimCustomer table).

· DimCustomer_Eh performs historical extraction only for the customer dimension.

· DimCustomer_TLh performs historical transformation and loading for the customer dimension.

· FactOrderLineItem_ETLPi performs incremental extraction, transformation, loading, and Analysis Services partition processing for the Orders fact table.

· Master_Orders_i runs subpackages for the incremental processing of any dimension tables used by the Orders business process, and then runs subpackages for the incremental processing of the one or more related fact tables in that dimensional model.

五、定义ETL详细规范

Document all the decisions we have discussed in this chapter, including:

· The default strategy for a package error handler

· The default strategy for extracting from each major source system

· The default approach for handling partitioning

· The design of the dimension manager

· Locations of staging areas

· Data model of the ETL process metadata

· Requirements for system availability, and the basic approach to meeting those requirements

A good specification will include between two and ten pages of detail for each table, and document the following information and decisions:

· Table design (column names, data types, keys, and constraints).

· How much historical data to load (for example, 37 months).

· Historical data volumes (row count).

· Incremental data volumes, measured as new and updated rows/load cycle.

· How to handle late-arriving data for facts and dimensions.

· Load frequency, like daily.

· How changes in each dimension attribute will be handled (like Type 1, 2, or 3).

· Fact table and associated Analysis Services partitioning strategy. Discuss how to implement partitioning if today’s load can include fact data for aged partitions.

· Overview of data sources, including a discussion of any unusual characteristics of the sources, like an unusually short access window.

· Detailed source-to-target mapping.

· Source data profiling, including at least the minimum and maximum values for each numeric column, the count of distinct values in each column, and the incidence of NULLs.

· Extract strategy for the source data (source system APIs, direct query from database, dump to flat files).

· Dependencies: Which other tables need to be loaded before this table is processed?

· Document the transformation logic. It’s easiest to write this section as pseudo-code, rather than trying to craft complete sentences. The more familiar you are with Integration Services, the more you can use shorthand like Conditional split on ColA>1000.

· Preconditions to avoid error conditions. For example, should the package check for file or database space before proceeding.

· Cleanup steps, for example deleting working files.

· Package error handler, if it differs from the default described previously.

· The packages you will create to handle the table’s processing, and the scope of work for each package.

· An estimate of whether this portion of the ETL system will be easy, medium-difficulty, or difficult to implement.

The final section of the ETL specification describes any master packages, and provides a first cut at job sequencing. Create a dependency tree that specifies which tables must be processed before others. Whether or not you choose to parallelize your processing, it’s important to know the logical dependencies that cannot be broken.