Updating fact and dimension tables（MS_SSAS）

Nicky_1218

于 2018-11-22 10:04:35 发布

阅读量184

点赞数

分类专栏： BI

本文链接：https://blog.csdn.net/Nicky_1218/article/details/84334280

版权

BI 专栏收录该内容

24 篇文章 1 订阅

订阅专栏

Be happy~

main body

In an ideal world, data that is stored in the data warehouse would never change.Some books suggest that we should only support insert operations in a data warehouse, not updates: data comes from the OLTP, is cleaned and is then stored in the data warehouse until the end of time, and should never change because it represents the situation at the time of insertion.

Nevertheless, the real world is somewhat different to the ideal one. While some updates are handled by the slowly changing dimension techniques already discussed, there are other kinds of updates needed in the life of a data warehouse. In our experience, these other types of update in the data warehouse are needed fairly regularly and are of two main kinds:

Structural updates: when the data warehouse is up and running, we will need to perform updates to add information like new measures or new dimension attributes. This is normal in the lifecycle of a BI solution.
Data updates: we need to update data that has already been loaded into the data warehouse, because it is wrong. We need to delete the old data and enter the new data, as the old data will inevitably lead to confusion. There are many reasons why bad data comes to the data warehouse; the sad reality is that bad data happens and we need to manage it gracefully.

Now, how do these kinds of updates interact with fact and dimension tables? Let's summarize briefly what the physical distinctions between fact and dimension tables are:

Dimension tables are normally small, usually with less than 1 million rows and very frequently much less than that.
Fact tables are often very large; they can have up to hundreds of millions or even billions of rows. Fact tables may be partitioned, and loading data into them is usually the most time-consuming operation in the whole of the data warehouse.

Structural updates on dimension tables are very easy to make. You simply update the table with the new metadata, make the necessary changes to your ETL procedures and the next time they are run the dimension will reflect the new values. If your users decide that they want to analyze data based on a new attribute on, say, the customer dimension, then the new attribute can be added for all of the customers in the dimension. Moreover, if the attribute is not present for some customers, then they can be assigned a default value; after all, updating one million rows is not a difficult task for SQL Server or any other modern relational database. However, even if updating the relational model is simple, the updates need to go through to Analysis Services and this might result in the need for a full process of the dimension and therefore the cube, which might be very time consuming.

On the other hand, structural updates may be a huge problem on fact tables. The problem is not that of altering the metadata, but determining and assigning a default value for the large number of rows that are already stored in the fact table. It's easy to insert data into fact tables. However, creating a new field with a default value would result in an UPDATE command that will probably run for hours and might even bring down your database server. Worse, if we do not have a simple default value to assign, then we will need to calculate the new value for each row in the fact table, and so the update operation will take even longer. We have found that it is often better to reload the entire fact table rather than perform an update on it. Of course, in order to reload the fact table, we need to have all of our source data at hand and this is not always possible.

Data updates are an even bigger problem still, both on facts and dimensions. Data updates on fact tables suffer from the same problems as adding a new field: often, the number of rows that we need to update is so high that running even simple SQL commands can take a very long time.

Data updates on dimensions can be a problem because they may require very complex logic. Suppose we have a Type 2 SCD and that a record was entered into the dimension table with incorrect attribute values. In this situation, we would have created a new record and linked all the facts received after its creation to the new (and incorrect) record. Recovering from this situation requires us to issue very precise UPDATE statements to the relational database and to recalculate all the fact table rows that depend – for any reason – on the incorrect record. Bad data in dimensions is not very easy to spot, and sometimes several days – if not months – pass before someone (in the worst case the user) discovers that something went wrong.

There is no foolproof way for stopping bad data getting into the data warehouse. When it happens, we need to be ready to spend a long time trying to recover from the error. It's worth pointing out that data warehouses or data marts that are rebuilt each night ("one shot databases") are not prone to this situation because, if bad data is corrected, the entire data warehouse can be reloaded from scratch and the problem fixed very quickly. This is one of the main advantages of "one shot" data warehouses, although of course they do suffer from several disadvantages too such as their limited ability to hold historic data.

Nicky_1218

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Updating fact and dimension tables（MS_SSAS）

Be happy~ main bodyIn an ideal world, data that is stored in the data warehouse would never change.Some books suggest that we should only support insert operations in a data warehouse, not updat...
复制链接

扫一扫

专栏目录