1. ETL Tool
In my project, there is an ETL Tool been internally implemented.
2.Which change data capture mechanisms should we choose?
Most data warehouse tables are so large that they cannot be refreshed during every ETL cycle. You must have a capability to transfer only the relevant changes to the source data since the last update. Isolating the latest source data is called change data capture (CDC). The idea behind change data capture is simple enough: just transfer the data that has been changed since the last load.
目前常见的策略有:
- 时间戳的的方式:源系统必须记录了表中记录的最新更新的时间,根据更新时间进行抽取。
- 日志解析:解析数据库系统的日志文件,获取当天在数据库上发生的操作,而捕获变化数据。
- 文本比较:前后两天的数据进行全文本比较,其中根据业务逻辑的需要可以采用部分字段比较和HASH比较的方式。
- 源系统记录方式:在源系统中建立触发器,或者应用系统直接记录变化数据的情况。这种情况对源系统依赖极大,简单,但是,很难实施 。
3. When should we stage the data?
In today's data warehousing environment, it's quite possible for ETL tools to establish a direct connection to the source database, extract and stream the data through the ETL tool to apply any required transformation in memory, and finally write it, only once, into the target data warehouse table. From a performance viewpoint, this is a great capability as writes, especially logged writes into the RDBMS, are very expensive; it's a good design goal to minimize them. However, despite the performance advantages, this may not be the best approach. There are several reasons an organization might decide to physically stage the data (i.e., write it to disk) during the ETL process:
- The most appropriate CDC method requires a compare of the current copy of the source table to the prior copy of the same table.
- The organization has elected to stage the data immediately after extract for archival purposes -- possibly to meet compliance and audit requirements.
- A recovery/restart point is desired in the event the ETL job fails in midstream -- potentially due to a break in the connection between the source and ETL environment.
- Long running ETL processes may open a connection to the source system that create problems with database locks and that stresses the transaction system.
4. Where should we correct data?
Business users are aware that data quality is a serious and expensive problem. Thus, most organizations are likely to support initiatives to improve data quality. But most users probably have no idea where data quality problems originate or what should be done to improve data quality. They may think that data quality is a simple execution problem for the ETL team, but data quality cannot be improved by ETL alone. In this environment, the ETL team needs to be agile and proactive, partnering with the business and the IT teams that support the source systems.
The key decision is where to correct the data. The best solution is clearly to have the data captured accurately in the first place. In most cases the data should be corrected back in the source system. Unfortunately, it is inevitable that poor quality data will reach the ETL system, and when this happens, there are three choices: 1) Halt the entire load process
2) Send the offending record(s) to a suspense file for later processing
3) Tag the data and pass it through.
The third choice is by far the best choice, whenever possible. Halting the process is obviously a pain because it requires manual intervention to diagnose the problem, restart or resume the job, or abort completely. Sending records to a suspense file is often a poor solution because it is not clear when or if these records will be fixed and reintroduced to the pipeline. Until the records are restored to the data flow, the overall integrity of the database is questionable because records are missing. I don't recommend using the suspense file for minor data transgressions. The third option of tagging the data with the error condition often works well. Bad fact table data can be tagged with an audit dimension that describes the overall data quality condition of the offending fact row. Bad dimension data can also be tagged using the audit dimension or, in the case of missing or garbage data, it can be tagged with unique error values in the field itself.
5. What data latency level do we need to support?
数据的加载频率 是指与源系统相比,ETL系统延迟多久加载数据。这个因素应该结合成本、业务需求考虑。数据加载的频率对ETL架构的成本和复杂性影响很大。采用何种策略 是和业务需求相关的,有些业务应该要求能实时的分析数据,而有很多应该,并没有这样的要求。目前主要有三种加载策略:
- 实时加载
- 每天多次加载
- 每日一次加载
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/12374161/viewspace-667965/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/12374161/viewspace-667965/