SSIS --Slowly Changing Dimension

最新推荐文章于 2024-10-05 14:16:38 发布

ckpyyin49663926

最新推荐文章于 2024-10-05 14:16:38 发布

阅读量149

点赞数

文章标签：数据库

Surrogate Keys

Also known as meaningless keys, substitute keys, non-natural keys, or artificial keys. A surrogate key is a unique value, usually an integer, assigned to each row in the dimension. This surrogate key becomes the primary key of the dimension table and is used to join the dimension to the associated foreign key field in the fact table.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.
Example:

On Jan 1 2010, Emp A belongs to Dept1, whatever sales made by this employee added to Dept1 but on June 1 2010 Emp A moved to Dept2. All his new sales contribution should be added to Dept2 from that day onwards and the old one should belong to the Dept2.

If let's say in this case we have used business key (Primary key as stated in RDBMS) within data warehouse everything would be allocated to Dept2 even what actually belongs to Dept1

If you use surrogate keys you could create on the 1^st June a new record for the Employee 'A' in your Employee Dimension with a new surrogate key.
This way in your fact table you have your old data (before June) with the SID of the Employee 'E1' + 'Dept1' All new data (after June) would take the SID of the employee 'E1' + 'Dept2'
Key Points:

A surrogate key is a unique value, usually an integer, assigned to each row in the dimension. This surrogate key becomes the primary key of the dimension table and is used to join the dimension to the associated foreign key field in the fact table.

The ability to track changes in dimension attributes over time is reason enough to implement surrogate keys.

Handling Slowly Changing Dimension through SSIS:

SCD and its types:

Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes.

For example, you may have a Dimension in your database that tracks the sales records of your company's salespeople. Creating sales reports seems simple enough, until a salesperson is transferred from one regional office to another or from one manager to another manager. How do you record such a change in your sales Dimension?

You could sum or average the sales by salesperson, but if you use that to compare the performance of salesmen, that might give misleading information. If the salesperson that was transferred used to work in a hot market where sales were easy, and now works in a market where sales are infrequent, her totals will look much stronger than the other salespeople in her new region, even if they are just as good. Or you could create a second salesperson record and treat the transferred person as a new sales person, but that creates problems also.

Dealing with these issues involves SCD management methodologies

The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is most appropriate when correcting certain types of data errors, such as the spelling of a name. (Assuming you won't ever need to know how it used to be misspelled in the past)

Another example would be of a database table that keeps supplier information.

Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record:

The obvious disadvantage to this method of managing SCDs is that there is no historical record kept in the data warehouse. You can't tell if your suppliers are tending to move to the Midwest, for example. But an advantage to this is that these are very easy to maintain. Type 2

The Type 2 method tracks historical data by creating multiple records in the dimensional tables with separate keys. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.

In the same example, if the supplier moves to Illinois, the table would look like this:

Another popular method for tuple versioning is to add effective date columns.

Null End_Date signifies current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index.

Transactions that reference this Surrogate Key (Supplier_Key) are then permanently bound to these time slices defined by each row in the slowly changing dimension table. If there are retrospective changes made to the contents of the dimension, or if a new set of attributes are added to the dimension (for example a Sales Rep column) which have different effective dates to those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCD are not a good choice if the dimensional model is subject to change.

The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns we designate for storing historical data. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 will add additional columns to the tables:

Note that this record cannot track all historical changes, such as when a supplier moves twice. Although, one possible version of this type is to create the field Previous_Supplier_State instead of Original_Supplier_State which will then track all historical changes.

Note: Type 3, keeps separate columns for both the old and new attribute values—sometimes called “alternate realities.” In our experience, Type 3 is less common because it involves changing the physical tables and is not very scalable.

SSIS has built in support for SCD (Sample Database is Attached)

Easy to get up and running

This is how Slowly Changing Dimension Wizard - Data Flow Looks like

Demo of Using SCD through SSIS:

Source Table:

select * from scd.dbo.SourceEmployees

Destination Table:

Execute Package for the first time will give this, which is nothing but a copy of Source Table

Now let us fire following statement.

update scd.dbo.SourceEmployees

set managerid=1002

where empname='G'

Execute Package and now you will find the difference, if you will notice here we get one more record for EmpBusinessID 1007 with EmpSurrogateID as 11 and EmpDate as Null and for old record EndDate inserted which signifies that this record is valid till 14^th Sep 2010

Step of Implementing SCD from SSIS (Applies to SQL 2005 / 2008 and 2008 R2)

1) Create a New package.

2) Got to Data Tab and drop one OLE DB Source & Configure it for database ->Source Table

3) From Toolbox drag & drop in Data Flow, connect OLE DB Source created in step 2 to SCD

4) Double Click on SCD will initiate SCD Wizard

5) Select a Dimension Table and Keys, here you need to specify the destination table, automatically mapping will be shown with matching col of source with destination

6) Specify EmpBusinessID as Business Key & Click on next tab, this will allow you to specify the Change Type.

There are three kinds of Attributes:

Fixed Attribute - For which value is not changing

Changing Attribute - For which value will be over-written (Type 1)

Historical Attribute- For which value is saved as a new record. Previous values are saved in the records marked as outdated (Type 2)

7) You can check or uncheck as per your requirement, I am un-checking these two

8) Specify how you want record changed value, using single column or start / end date

9) While loading data in Fact tables we usually see a scenario where the fact data is available but there is no corresponding business key in the related dimension.

In this case we choose multiple options to resolve the issue.

Ignore that fact

Insert the associated business key in dimension table and return the newly generated surrogate key from dimension table. And now store the data in Fact table with the surrogate key.

The second approach relates to a term called “Inferred members”. All the other attributes of that dimension will also be updated in next run of dimension load (usually nightly load).

This is al-together a lengthy topic for discussion and outside the scope of this session, for further information refers below mentioned links:-

SSIS approach to handle Inferred Members

Inferred Members Implementation Best Practices

For this demo, I am not checking Enable Inferred member support