Updating fact and dimension tables(MS_SSAS)

Be happy~

 

main body

In an ideal world, data that is stored in the data warehouse would never change.Some books suggest that we should only support insert operations in a data warehouse, not updates: data comes from the OLTP, is cleaned and is then stored in the data warehouse until the end of time, and should never change because it represents the situation at the time of insertion.

 

Nevertheless, the real world is somewhat different to the ideal one. While some updates are handled by the slowly changing dimension techniques already discussed, there are other kinds of updates needed in the life of a data warehouse. In our experience, these other types of update in the data warehouse are needed fairly regularly and are of two main kinds:

  • Structural updates: when the data warehouse is up and running, we will need to perform updates to add information like new measures or new dimension attributes. This is normal in the lifecycle of a BI solution.

  • Data updates: we need to update data that has already been loaded into the data warehouse, because it is wrong. We need to delete the old data and enter the new data, as the old data will inevitably lead to confusion. There are many reasons why bad data comes to the data warehouse; the sad reality is that bad data happens and we need to manage it gracefully.

 

Now, how do these kinds of updates interact with fact and dimension tables? Let's summarize briefly what the physical distinctions between fact and dimension tables are:

  • Dimension tables are normally small, usually with less than 1 million rows and very frequently much less than that.

  • Fact tables are often very large; they can have up to hundreds of millions or even billions of rows. Fact tables may be partitioned, and loading data into them is usually the most time-consuming operation in the whole of the data warehouse.

Structural updates on dimension tables are very easy to make. You simply update the table with the new metadata, make the necessary changes to your ETL procedures and the next time they are run the dimension will reflect the new values. If your users decide that they want to analyze data based on a new attribute on, say, the customer dimension, then the new attribute can be added for all of the customers in the dimension. Moreover, if the attribute is not present for some customers, then they can be assigned a default value; after all, updating one million rows is not a difficult task for SQL Server or any other modern relational database. However, even if updating the relational model is simple, the updates need to go through to Analysis Services and this might result in the need for a full process of the dimension and therefore the cube, which might be very time consuming.

 

On the other hand, structural updates may be a huge problem on fact tables. The problem is not that of altering the metadata, but determining and assigning a default value for the large number of rows that are already stored in the fact table. It's easy to insert data into fact tables. However, creating a new field with a default value would result in an UPDATE command that will probably run for hours and might even bring down your database server. Worse, if we do not have a simple default value to assign, then we will need to calculate the new value for each row in the fact table, and so the update operation will take even longer. We have found that it is often better to reload the entire fact table rather than perform an update on it. Of course, in order to reload the fact table, we need to have all of our source data at hand and this is not always possible.

 

Data updates are an even bigger problem still, both on facts and dimensions. Data updates on fact tables suffer from the same problems as adding a new field: often, the number of rows that we need to update is so high that running even simple SQL commands can take a very long time.

Data updates on dimensions can be a problem because they may require very complex logic. Suppose we have a Type 2 SCD and that a record was entered into the dimension table with incorrect attribute values. In this situation, we would have created a new record and linked all the facts received after its creation to the new (and incorrect) record. Recovering from this situation requires us to issue very precise UPDATE statements to the relational database and to recalculate all the fact table rows that depend – for any reason – on the incorrect record. Bad data in dimensions is not very easy to spot, and sometimes several days – if not months – pass before someone (in the worst case the user) discovers that something went wrong.

 

There is no foolproof way for stopping bad data getting into the data warehouse. When it happens, we need to be ready to spend a long time trying to recover from the error. It's worth pointing out that data warehouses or data marts that are rebuilt each night ("one shot databases") are not prone to this situation because, if bad data is corrected, the entire data warehouse can be reloaded from scratch and the problem fixed very quickly. This is one of the main advantages of "one shot" data warehouses, although of course they do suffer from several disadvantages too such as their limited ability to hold historic data.

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值