Implementing Slowly Changing Dimensions
Handling changes to dimensional data across time is one of the trickiest aspects of designing a data warehouse. Dimensional attributes rarely remain static. An individual's address can change, sales reps come and go, and companies introduce new products and phase out old ones. Changing dimensional data can present far-ranging implications when you view the changes over time. For example, if a company reassigns a sales territory to a new sales representative, how can you record the change without making it appear as if the new sales rep had always held that territory? If a customer's name changes, how can you record the change and preserve the old version of the name? Designing a database that accurately and efficiently handles changes is a critical consideration when you're building a data warehouse. In this article, I'll define slowly changing dimensions, discuss the types of changes that a data warehouse needs to accommodate, and show you how to design and maintain a data warehouse that effectively handles change.
Slowly Changing Dimensions Defined
In a dimensional model, you're most interested in reporting on and analyzing facts and measures. However, facts and measures are relevant only when you define them in the context of their dimensions. Dimensions and their attributes, such as those in the examples above, are relatively constant, but they do change over time.
The term slowly changing dimensions is the variation in dimensional attributes over time. The word slowly in this context might seem incorrect. For example, a salesperson or store dimension, as Figure 1 shows, might change rapidly if a company reorganizes. But in general, when compared to a measure in a fact table, changes to dimensional data occur slowly.
The Types of Slowly Changing Dimensions
Within a data warehouse, you can handle slowly changing dimensions in several ways. These methods fall into various categories based on the company's need to preserve an accurate history of the dimensional changes. Ralph Kimball, author of The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses (John Wiley & Sons, 1996), categorized dimensional changes into three types—changes that overwrite history (Type One), preserve history (Type Two), and preserve a version of history (Type Three). The method you use to handle change depends on your company's need to maintain an accurate history of the dimension and the frequency of changes you expect. Here are definitions for each dimensional change type.
Overwriting history (Type One). A Type One change overwrites an existing dimensional attribute with new information. In the customer name-change example, the new name overwrites the old name, and the value for the old version is lost. A Type One change updates only the attribute, doesn't insert new records, and affects no keys.
Preserving history (Type Two). A Type Two change writes a record with the new attribute information and preserves a record of the old dimensional data. Type Two changes let you preserve historical data. Implementing Type Two changes within a data warehouse might require significant analysis and development. Type Two changes accurately partition history across time more effectively than other types. However, because Type Two changes add records, they can significantly increase the database's size.
Preserving a version of history (Type Three). You usually implement Type Three changes only if you have a limited need to preserve and accurately describe history, such as when someone gets married and you need to retain the previous name. Instead of creating a new dimensional record to hold the attribute change, a Type Three change places a value for the change in the original dimensional record. You can create multiple fields to hold distinct values for separate points in time. In the case of a name change, you could create an OLD_NAME and NEW_NAME field and a NAME_CHANGE_EFF_DATE field to record when the change occurs. This method preserves the change. But how would you handle a second name change, or a third, and so on? The side effects of this method are increased table size and, more important, increased complexity of the queries that analyze historical values from these old fields. After more than a couple of iterations, queries become impossibly complex, and ultimately you're constrained by the maximum number of attributes allowed on a table.
Because most business requirements include tracking changes over time, data warehouse architects commonly implement Type Two changes. A data warehouse might use Type Two changes for all attributes in all tables. As an alternative, you can implement a mix of Type One and Type Two changes at an attribute level by implementing Type 2 changes for only attributes whose historical values are important when you're slicing and dicing. For example, users might not need to know an individual's previous name if a name change occurs, so a Type One change would suffice. Users might want the system to show only the person's current name. However, if the company reassigns sales territories, users might need to track who sold what, at what time, and in what territory, necessitating a Type Two change.
Although most data warehouses include Type Two changes, you need to seriously examine the business need to record historical data. Implementing Type Two changes might be necessary, but those changes will increase the database size, degrade performance, and lengthen the development time. You need to carefully evaluate using a Type Two implementation, a Type One implementation, or a hybrid implementation.
Implementing the Types of Slowly Changing Dimensions
Although the three slowly changing dimension types are simple in concept, implementing them isn't trivial. The key to successfully implementing the various types of slowly changing dimensions is in the database design. The design of dimensional models accommodates changes to dimensions. Figure 1 illustrates a basic dimensional model with a single fact table and multiple dimensional tables, including a time dimension. Implementing the various change types is relatively easy in this simple design. For a Type One change, you find and update the appropriate attributes on a specific dimensional record. For example, to update a record in the SALES_PERSON_ DIMENSION to show a change to an individual's SALES_PERSON_NAME field, you simply update one record in the SALES_PERSON_DIMENSION table. This action would update or correct that record for all fact records across time. In a dimensional model, facts have no meaning until you link them with their dimensions. If you change a dimensional attribute without appropriately accounting for the time dimension, the change becomes global across all fact records.
Updating the SALES_PERSON_NAME field once and seeing the change across all fact records is efficient and makes good business sense if the update corrects a misspelling, for example. But suppose a salesperson transfers to a new sales team. Updating the salesperson's dimensional record would update all previous facts so that the salesperson would appear to have always belonged to the new sales team. If you want to preserve an accurate history of who was on which sales team, a Type One change might work for the Sales Name field, but not for the Sales Team field.
To solve the sales team problem, let's look at it as a Type Three change (because of the complexity, I'll look at Type Two changes last in these examples). To implement the Type Three change, you need to alter the dimension structure slightly so it looks like Figure 2. In Figure 2, a Type Three change adds two new attributes—old and new Sales Team fields—and renames one attribute to record the date of the change. Although this method might solve the sales team problem, a Type Three implementation has three disadvantages:
- You can preserve only one change per attribute—old and new or first and last.
- Each Type Three change requires a minimum of one additional field per attribute and two additional fields if you want to record the date of the change.
- Although the dimension's structure contains all the data needed, the SQL code required to extract the information can be complex. Extracting a specific value isn't difficult, but if you want to obtain a value for a specific point in time or multiple attributes with separate old and new values, the SQL statements become long and have multiple conditions. Overall, Type Three changes can store the data of a change, but they can't accommodate multiple changes, nor can they adequately serve the need for summary reporting.
A Type Two change answers the sales team dilemma. With a Type Two change, you don't need to make structural changes to the Sales_Person_Dimension table, but you add a record. Suppose you have a record that looks like Table 1. After you have implemented the Type Two Change, two records appear, as Table 2 illustrates. Each record is related to appropriate facts, which are related to specific points in time in the Time Dimension. This capability illustrates a key benefit of the dimensional model. Although dimensional star schema models are denormalized by OLTP standards, they are partially normalized. One dimensional record may relate to thousands or even millions of fact records. Using a dimensional model drastically reduces—but doesn't eliminate—data redundancy, reduces the database size accordingly, and makes joins more efficient.
This example illustrates two ancillary but important design features. First, the primary key for the Sales_Person_Dimension table is meaningless or generalized, and is typically just a sequential number. You need to avoid designating a primary key that has business value, such as the SALES_PERSON_ID. When you design an OLTP database, you generally don't want to use a field that has meaning as a primary key. This rule is more important in a dimensional model, and even more important when you're implementing Type Two changes. If, for example, you use the SALES_PERSON_ID as the primary key and you assume this business key will always remain unique (which is a dangerous assumption), a Type Two implementation needs to accommodate multiple records per person. In the example of the Type Two change above, the SALES_PERSON_ID never changes. The person is still the same person, but his or her information changes over time. Attempting to use the SALES_PERSON_ID field would cause a primary key integrity violation when you try to insert the second row of data for that same person.
Suppose the source system for the database uses a meaningless key that is guaranteed to never change. Consider implementing the primary key as a composite key that comprises the primary key from the source system and a sequence key identifying the number of the change record in relation to the first record for that entity. Table 3 illustrates the use of a composite key.
Although this composite key can help you move data from the source, it increases the size of the key in the table and in its related fact tables. You need to weigh the efficiencies you'll gain from using the source system key against the related increase in key size. Again, you take a risk if you assume that the source system keys won't change or will always be unique. Assuming that the source system keys won't change forces the data warehouse to depend on the referential integrity of the source system. Any major change to the source system could significantly affect the data warehouse. If the data warehouse has a separate, meaningless key, the data warehouse will more likely be encapsulated well enough that a major source system change would affect the warehouse only internally and would not affect the interfaces to the warehouse.
The second design rule that this example illustrates is the lack of effective dates that identify exactly when the change in the SALES_PERSON_DIMENSION occurred. You don't need effective dates because the TIME_DIMENSION will effectively partition the data over time when it's related to the fact table. The effectiveness of this partitioning depends on the grain of the fact table. If a company slices the facts every day, the system will record the date of the change, but not the hour and minute of the change. The system will record the value the attribute had at the end of the time period.
Although Type Two changes add some complexity, they are manageable in the context of this simple dimensional model. You can design the simplest data marts with a single fact table and a few dimension tables. Multiple fact tables or a snowflake design can dramatically increase the complexity. However, before we look at these issues, let's address implementing slowly changing dimensions in a normalized model. The same issues that arise in multiple fact tables and snowflakes also arise in a normalized design.
Implementing the Change Types in a Normalized Model
In a normalized model (at the third normal form), fact tables and the time dimension might disappear, but the problems in implementing slowly changing dimensions are the same. You can implement any of the three change types, but their implementation becomes more difficult. At first, a normalized model might not seem worth considering in the context of a data warehouse. But normalized models have their place in data warehousing. You encounter dimensional models, especially ones with a single fact table, more often in departmental data marts because you can reduce a department's data requirements to a single set of facts. Enterprise data warehouses (which span the enterprise and have multiple subject areas and data sources), operational data stores (which offer realtime or near realtime data), and data staging areas often take on a more normalized style or hybrid implementation. The ultimate benefit of a dimensional model is the query efficiency for the end users. Enterprise data warehouses, operational data stores, and data staging areas need the capability to load in large amounts of data very fast. A relational design serves this need well.
You implement Type One and Type Three changes in a normalized model the same way you do in a dimensional model, but because a normalized model has no time dimension, implementing Type Two changes is different. Figure 3 illustrates the previous dimensional model translated into a normalized style. The database system uses effective start and end dates instead of the time dimension to store the period for which each record is valid. The previous sales team example before the Type Two change occurred would look like Table 4. After you implemented the Type Two Change, the example would contain two records, as Table 5 illustrates.
A Type Two change would cause an insert, as in the dimensional model, but also an update in the old record. In this normalized model, the SALES_PERSON table's primary key is a composite key of a sequential number and the record's effective start date. An update of the existing record doesn't update its primary key, but by adding a value to the effective end date, the update bounds the period for which this record is valid. The new record has a new effective start date and a NULL for the effective end date, showing that the record remains current.
For querying and reporting purposes, the most current record has an indicator flag set to Y, which means that for the given entity, this record is the most current one in the table. You'll find this marker useful when you query the system if it's an operational data store or when you load data into a dimensional data mart if the table is part of an enterprise data warehouse.
Detecting Changes When Loading Data
Determining which change types to implement is your first step in managing slowly changing dimensions. Detecting a change is the next step. How and whether you need to detect changes depends on the data from the source system. Describing the many variations of accessing the source data is beyond the scope of this article, but suffice it to say that the data load programs' complexity primarily depends on the format of the source data. An example of a complex process would be one in which the data warehouse receives a transactional feed from the source system or only a copy of the source system itself, meaning you have to do the compare snapshots of the source data. You need to compare how the incoming data currently looks to how it looked the last time the data was loaded. An example of a process that's easier to manage is one in which the extraction from the source is only a change file with an addition, update, or delete flag appended to each record.
Note that if you use a Type One and Type Two hybrid, the change you're implementing depends on the specific attributes you're changing. For example, in the Person Dimension in Figure 4, you might determine that the attribute Hobbies is worthy of only a Type One change, but an attribute such as City or State might warrant a Type Two change. An advantage of using this technique is that you can create Type Two (additional) records when you need them, which saves database space and helps query performance. A disadvantage of using this technique is that it requires an extra step to determine whether a Type Two attribute was changed. You can easily detect these changes by comparing the source attribute to the target attribute. But this comparison can become lengthy and unwieldy if you need to compare many attributes. Every attribute addition to the database would require a code change to this compare routine.
As an alternative, you can build an internal table to store the attributes that signal Type Two changes. This table requires only two fields, TABLE_NAME and ATTRIBUTE_NAME. You can then populate the table with all the attributes in the database that signal a Type Two change. The load program (bulk copy program—bcp—won't work here) then needs to query this table, find the appropriate attributes for the table it's loading, and use the result set to build a list of Type Two attributes to check for. This approach allows the designation of Type Two attributes to be dynamic, readily accommodating changes to the table structure.
Cascading Type Two Changes
In a normalized model or a snowflake dimensional model, a Type Two change to a primary key of one table can—and often does—cause the need for an update to the foreign keys of records in related tables. You need to deal with this consequence programmatically with stored procedures or triggers. In SQL Server 2000, look forward to cascading declarative referential integrity (DRI) to significantly ease this burden.
Figure 5 shows a dimensional model with a snowflake design. In this design, you can store more detailed information about the sales region without adding undue overhead to every SALES_PERSON_DIMENSION record. However, the snowflake design of the SALES_PERSON_DIMENSION and REGION table further complicates implementing Type Two changes that you need to resolve with programming logic. If you implement a Type Two change to a record in the SALES_PERSON_DIMENSION, you need to preserve only the SALES_PERSON_REGION_KEY attribute on the record. However, suppose you implement a Type Two change on the Region table. At minimum, you need to take every record in the SALES_PERSON_DIMENSION that includes the old foreign key and update it with the new foreign key for the Type Two change. And suppose a change to the SALES_PERSON_REGION_KEY is a Type Two change in the SALES_PERSON_DIMENSION table. Then for every SALES_PERSON_DIMENSION record, you need to create a Type Two change linking the record to the new Type Two change in the Region table.
Aggregations and Type One Changes
Implementing aggregate dimensions in a database can significantly improve performance for certain queries. However, the performance benefits come at a price. Because Type One changes can overwrite history, they can also change the aggregated history in any aggregated tables. Therefore, you need to implement a process to accurately and efficiently update the aggregate tables. In general, you can solve this problem in one of three ways:
Recalculate the aggregate tables. You can rebuild the entire physical aggregate dimension from scratch. This method doesn't require extra code, but the processing times might be long, depending on the data volumes.
Use virtual aggregates. With this method, you don't build an aggregate dimension. Instead, you implement the aggregated dimensions with views by using the GROUP BY parameter in your SELECT statement. This method doesn't require extra updating logic, but the query times might be long and it serves only the most basic aggregations.
Use corrected transactions. With this option, the aggregate table physically exists, and you need to write a custom program to detect changes in the detail dimension and create a transaction that updates the associated aggregate records.
Designing for Change
If you design or use a data warehouse, you can't avoid slowly changing dimensions and their considerable design and development overhead. The challenges of handling slowly changing dimensions go far beyond the examples in this article. For example, implementing slowly changing dimensions in OLAP Services presents complex challenges (for more information, see "Data Warehousing Step By Step," page 25). If you design a data warehouse with a clear understanding of the types of changes, the business need to preserve historical information, and how to properly implement history, the data warehouse will meet users' expectations well into the future.