SSIS --Slowly Changing Dimension

Surrogate Keys

Also known as meaningless keys, substitute keys, non-natural keys, or artificial keys. A surrogate key is a unique value, usually an integer, assigned to each row in the dimension. This surrogate key becomes the primary key of the dimension table and is used to join the dimension to the associated foreign key field in the fact table.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult. 
Example:

On Jan 1 2010, Emp A belongs to Dept1, whatever sales made by this employee added to Dept1 but on June 1 2010 Emp A moved to Dept2. All his new sales contribution should be added to Dept2 from that day onwards and the old one should belong to the Dept2.

If let's say in this case we have used business key (Primary key as stated in RDBMS) within data warehouse everything would be allocated to Dept2 even what actually belongs to Dept1

If you use surrogate keys you could create on the 1st June a new record for the Employee 'A' in your Employee Dimension with a new surrogate key. 
This way in your fact table you have your old data (before June) with the SID of the Employee 'E1' + 'Dept1' All new data (after June) would take the SID of the employee 'E1' + 'Dept2' 
Key Points:

A surrogate key is a unique value, usually an integer, assigned to each row in the dimension. This surrogate key becomes the primary key of the dimension table and is used to join the dimension to the associated foreign key field in the fact table.

The ability to track changes in dimension attributes over time is reason enough to implement surrogate keys.

Handling Slowly Changing Dimension through SSIS:

SCD and its types:

Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes.

For example, you may have a Dimension in your database that tracks the sales records of your company's salespeople. Creating sales reports seems simple enough, until a salesperson is transferred from one regional office to another or from one manager to another manager. How do you record such a change in your sales Dimension?

You could sum or average the sales by salesperson, but if you use that to compare the performance of salesmen, that might give misleading information. If the salesperson that was transferred used to work in a hot market where sales were easy, and now works in a market where sales are infrequent, her totals will look much stronger than the other salespeople in her new region, even if they are just as good. Or you could create a second salesperson record and treat the transferred person as a new sales person, but that creates problems also.

Dealing with these issues involves SCD management methodologies

The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is most appropriate when correcting certain types of data errors, such as the spelling of a name. (Assuming you won't ever need to know how it used to be misspelled in the past)

Another example would be of a database table that keeps supplier information.

clip_image002[4]

Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record:

clip_image003[4]

The obvious disadvantage to this method of managing SCDs is that there is no historical record kept in the data warehouse. You can't tell if your suppliers are tending to move to the Midwest, for example. But an advantage to this is that these are very easy to maintain. Type 2

The Type 2 method tracks historical data by creating multiple records in the dimensional tables with separate keys. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.

In the same example, if the supplier moves to Illinois, the table would look like this:

clip_image004[4]

Another popular method for tuple versioning is to add effective date columns.

clip_image005[4]

Null End_Date signifies current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index.

Transactions that reference this Surrogate Key (Supplier_Key) are then permanently bound to these time slices defined by each row in the slowly changing dimension table. If there are retrospective changes made to the contents of the dimension, or if a new set of attributes are added to the dimension (for example a Sales Rep column) which have different effective dates to those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCD are not a good choice if the dimensional model is subject to change.

The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns we designate for storing historical data. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 will add additional columns to the tables:

clip_image006[4]

Note that this record cannot track all historical changes, such as when a supplier moves twice. Although, one possible version of this type is to create the field Previous_Supplier_State instead of Original_Supplier_State which will then track all historical changes.

Note: Type 3, keeps separate columns for both the old and new attribute values—sometimes called “alternate realities.” In our experience, Type 3 is less common because it involves changing the physical tables and is not very scalable.

SSIS has built in support for SCD (Sample Database is Attached)

Easy to get up and running

This is how Slowly Changing Dimension Wizard - Data Flow Looks like

clip_image008[4]

Demo of Using SCD through SSIS:

Source Table:

select * from scd.dbo.SourceEmployees

clip_image010[4]

Destination Table:

clip_image012[4]

Execute Package for the first time will give this, which is nothing but a copy of Source Table

clip_image013[4]

Now let us fire following statement.

update scd.dbo.SourceEmployees

set managerid=1002

where empname='G'

Execute Package and now you will find the difference, if you will notice here we get one more record for EmpBusinessID 1007 with EmpSurrogateID as 11 and EmpDate as Null and for old record EndDate inserted which signifies that this record is valid till 14th Sep 2010clip_image014[4]

Step of Implementing SCD from SSIS (Applies to SQL 2005 / 2008 and 2008 R2)

1) Create a New package.

2) Got to Data Tab and drop one OLE DB Source & Configure it for database ->Source Table

clip_image015[4]

clip_image016[4]

3) From Toolbox drag & drop in Data Flow, connect OLE DB Source created in step 2 to SCD

clip_image017[4]

4) Double Click on SCD will initiate SCD Wizard

5) Select a Dimension Table and Keys, here you need to specify the destination table, automatically mapping will be shown with matching col of source with destination

clip_image018[4]

6) Specify EmpBusinessID as Business Key & Click on next tab, this will allow you to specify the Change Type.

There are three kinds of Attributes:

Fixed Attribute - For which value is not changing

Changing Attribute - For which value will be over-written (Type 1)

Historical Attribute- For which value is saved as a new record. Previous values are saved in the records marked as outdated (Type 2)

clip_image019[4]

7) You can check or uncheck as per your requirement, I am un-checking these two

clip_image020[4]

8) Specify how you want record changed value, using single column or start / end date

clip_image021[4]

9) While loading data in Fact tables we usually see a scenario where the fact data is available but there is no corresponding business key in the related dimension.

In this case we choose multiple options to resolve the issue.

Ignore that fact

Insert the associated business key in dimension table and return the newly generated surrogate key from dimension table. And now store the data in Fact table with the surrogate key.

The second approach relates to a term called “Inferred members”. All the other attributes of that dimension will also be updated in next run of dimension load (usually nightly load).

This is al-together a lengthy topic for discussion and outside the scope of this session, for further information refers below mentioned links:-

SSIS approach to handle Inferred Members

Inferred Members Implementation Best Practices

For this demo, I am not checking Enable Inferred member support

clip_image022[4]

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/7899089/viewspace-732831/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/7899089/viewspace-732831/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值