(五)进阶技术
12. 无事实的事实表
本篇讨论一种技术,用来处理源数据中没有度量的需求。例如,产品源数据不包含产品数量信息,如果系统需要得到产品的数量,很显然不能简单地从数据仓库中直接得到。这时就要用到无事实的事实表技术。使用此技术可以通过持续跟踪产品的发布来计算产品的数量。可以创建一个只有产品(计什么数)和日期(什么时候计数)维度代理键的事实表。之所以叫做无事实的事实表是因为表本身并没有度量。
产品发布的无事实事实表 本节说明如何实现一个产品发布的无事实事实表,包括新增和初始装载product_count_fact表。图(五)- 12-1显示了跟踪产品发布数量的数据仓库模式(只显示与product_count_fact表有关的表)。
执行清单(五)-12-1里的脚本创建产品发布日期视图和无事实事实表。
说明:产品发布日期视图只取产品生效日期,并不是日期维度里的所有日期。product_launch_date_dim维度表是日期维度表的子集。
下面要初始装载product_count_fact表。清单(五)-12-2从product_dim表向product_count_fact表装载已有的产品发布信息。脚本里的insert添加所有产品的第一个版本(即产品的首次发布日期)。注意做了GROUP BY的select语句返回最早插入的行,这正确地选取了产品发布时的生效日期,而不是一个SCD2行的生效日期。
图(五)- 12-2到图(五)- 12-6所示为使用Kettle初始装载产品数量事实表的步骤。
mysql> select * from product_count_fact;
+------------+------------------------+
| product_sk | product_launch_date_sk |
+------------+------------------------+
| 1 | 4809 |
| 2 | 4809 |
| 3 | 4809 |
| 5 | 5539 |
+------------+------------------------+
4 rows in set (0.00 sec)
修改DW定期装载脚本
修改了数据仓库模式后,还需要修改定期装载脚本。该脚本在导入product_dim表后还要导入product_count_fact表。清单(五)-12-3显示了修改后的定期装载脚本。实际上只是把清单(五)-12-2中的脚本加在了清单(五)- 9-2里的装载订单事实表之前。
使用Kettle作业定期装载,只对(五)进阶技术9. “杂项维度”中的定期装载做了两点修改:“清空过渡表”作业项加了清空product_count_fact表;把初始装载产品数量事实表的步骤合并到了“装载事实表(定期)”作业项里。如图(五)- 12-7、图(五)- 12-8所示。
测试定期装载
在测试修改的定期装载前,要做两点准备。
第一,修改源数据库的product表,把产品编码为1的产品名称改为‘Regular Hard Disk Drive’,并新增一个产品‘High End Hard Disk Drive’(产品编码为5)。执行清单(五)-12-4里的脚本完成此此修改。查询修改后的产品数据如下所示。
mysql> select * from source.product;
+--------------+--------------------------+------------------+
| product_code | product_name | product_category |
+--------------+--------------------------+------------------+
| 1 | Regular Hard Disk Drive | Storage |
| 2 | Floppy Drive | Storage |
| 3 | Flat Panel | Monitor |
| 4 | Keyboard | Peripheral |
| 5 | High End Hard Disk Drive | Storage |
+--------------+--------------------------+------------------+
5 rows in set (0.00 sec)
现在执行清单(五)-12-3里的脚本或对应的Kettle作业,通过查询product_count_fact表确认定期装载执行正确。查询语句和结果如下所示。
mysql> SELECT
-> c.product_sk psk,
-> c.product_code pc,
-> b.product_launch_date_sk plsk,
-> b.product_launch_date pld
-> FROM
-> product_count_fact a,
-> product_launch_date_dim b,
-> product_dim c
-> WHERE
-> a.product_launch_date_sk = b.product_launch_date_sk
-> AND a.product_sk = c.product_sk
-> ORDER BY product_code , product_launch_date;
+-----+------+------+------------+
| psk | pc | plsk | pld |
+-----+------+------+------------+
| 1 | 1 | 4809 | 2013-03-01 |
| 2 | 2 | 4809 | 2013-03-01 |
| 3 | 3 | 4809 | 2013-03-01 |
| 5 | 4 | 5539 | 2015-03-01 |
| 7 | 5 | 5557 | 2015-03-19 |
+-----+------+------+------------+
5 rows in set (0.06 sec)
12. 无事实的事实表
本篇讨论一种技术,用来处理源数据中没有度量的需求。例如,产品源数据不包含产品数量信息,如果系统需要得到产品的数量,很显然不能简单地从数据仓库中直接得到。这时就要用到无事实的事实表技术。使用此技术可以通过持续跟踪产品的发布来计算产品的数量。可以创建一个只有产品(计什么数)和日期(什么时候计数)维度代理键的事实表。之所以叫做无事实的事实表是因为表本身并没有度量。
产品发布的无事实事实表 本节说明如何实现一个产品发布的无事实事实表,包括新增和初始装载product_count_fact表。图(五)- 12-1显示了跟踪产品发布数量的数据仓库模式(只显示与product_count_fact表有关的表)。
图(五)- 12-1
执行清单(五)-12-1里的脚本创建产品发布日期视图和无事实事实表。
USE dw;
CREATE VIEW product_launch_date_dim (product_launch_date_sk , product_launch_date , month_name , month , quarter , year , promo_ind , effective_date , expiry_date) AS
SELECT DISTINCT
date_sk,
date,
month_name,
month,
quarter,
year,
promo_ind,
b.effective_date,
b.expiry_date
FROM
product_dim a,
date_dim b
WHERE
a.effective_date = b.date;
CREATE TABLE product_count_fact (
product_sk INT,
product_launch_date_sk INT,
primary key (product_sk , product_launch_date_sk),
foreign key (product_sk)
references product_dim (product_sk)
on delete cascade on update cascade,
foreign key (product_launch_date_sk)
references date_dim (date_sk)
on delete cascade on update cascade
);
清单(五)-12-1
下面要初始装载product_count_fact表。清单(五)-12-2从product_dim表向product_count_fact表装载已有的产品发布信息。脚本里的insert添加所有产品的第一个版本(即产品的首次发布日期)。注意做了GROUP BY的select语句返回最早插入的行,这正确地选取了产品发布时的生效日期,而不是一个SCD2行的生效日期。
USE dw;
TRUNCATE product_count_fact;
/* for new products */
INSERT INTO product_count_fact(product_sk, product_launch_date_sk)
SELECT
a.product_sk
, b.date_sk
FROM
product_dim a
, date_dim b
WHERE
a.effective_date = b.date
GROUP BY product_code;
COMMIT;
清单(五)-12-2
图(五)- 12-2到图(五)- 12-6所示为使用Kettle初始装载产品数量事实表的步骤。
图(五)- 12-2
图(五)- 12-3
图(五)- 12-4
图(五)- 12-5
图(五)- 12-6
注意:
- “产品维度”步骤里的查询使用了version = 1过滤条件,而没有用清单(五)-12-2里面的group by,但结果是等价的。
- “流查询”步骤的主数据流步骤为“产品维度”,查找数据流步骤为“日期维度”。查找数据流有8036条数据,主数据流有6条数据。
- 查找数据流中的数据会被一直读取,直到数据全部(8036条)读入缓存,才进行流查询。
- 流查询步骤是依据主数据流查找匹配条件的查找数据流,所以该步骤的输出记录数与主数据流记录数相同(4条)。
mysql> select * from product_count_fact;
+------------+------------------------+
| product_sk | product_launch_date_sk |
+------------+------------------------+
| 1 | 4809 |
| 2 | 4809 |
| 3 | 4809 |
| 5 | 5539 |
+------------+------------------------+
4 rows in set (0.00 sec)
修改DW定期装载脚本
修改了数据仓库模式后,还需要修改定期装载脚本。该脚本在导入product_dim表后还要导入product_count_fact表。清单(五)-12-3显示了修改后的定期装载脚本。实际上只是把清单(五)-12-2中的脚本加在了清单(五)- 9-2里的装载订单事实表之前。
USE dw;
-- 设置SCD的截止时间和生效时间
SET @pre_date = SUBDATE(CURRENT_DATE,1) ;
-- 设置CDC的上限时间
UPDATE cdc_time SET current_load = CURRENT_DATE ;
-- 装载客户维度
TRUNCATE TABLE customer_stg;
INSERT INTO customer_stg
SELECT
customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, shipping_address
, shipping_zip_code
, shipping_city
, shipping_state
FROM source.customer ;
/* 在所有地址列上 SCD2 */
/* 置过期 */
UPDATE customer_dim a,
customer_stg b
SET
expiry_date = @pre_date
WHERE
a.customer_number = b.customer_number
AND (a.customer_street_address <> b.customer_street_address
OR a.customer_city <> b.customer_city
OR a.customer_zip_code <> b.customer_zip_code
OR a.customer_state <> b.customer_state
OR a.shipping_address <> b.shipping_address
OR a.shipping_city <> b.shipping_city
OR a.shipping_zip_code <> b.shipping_zip_code
OR a.shipping_state <> b.shipping_state
OR a.shipping_address IS NULL
OR a.shipping_city IS NULL
OR a.shipping_zip_code IS NULL
OR a.shipping_state IS NULL)
AND expiry_date = '2200-01-01';
/* 加新行 */
INSERT INTO customer_dim
SELECT
NULL
, b.customer_number
, b.customer_name
, b.customer_street_address
, b.customer_zip_code
, b.customer_city
, b.customer_state
, b.shipping_address
, b.shipping_zip_code
, b.shipping_city
, b.shipping_state
, a.version + 1
, @pre_date
, '2200-01-01'
FROM
customer_dim a
, customer_stg b
WHERE
a.customer_number = b.customer_number
AND ( a.customer_street_address <> b.customer_street_address
OR a.customer_city <> b.customer_city
OR a.customer_zip_code <> b.customer_zip_code
OR a.customer_state <> b.customer_state
OR a.shipping_address <> b.shipping_address
OR a.shipping_city <> b.shipping_city
OR a.shipping_zip_code <> b.shipping_zip_code
OR a.shipping_state <> b.shipping_state
OR a.shipping_address IS NULL
OR a.shipping_city IS NULL
OR a.shipping_zip_code IS NULL
OR a.shipping_state IS NULL)
AND EXISTS(
SELECT *
FROM customer_dim x
WHERE
b.customer_number=x.customer_number
AND a.expiry_date = @pre_date )
AND NOT EXISTS (
SELECT *
FROM customer_dim y
WHERE
b.customer_number = y.customer_number
AND y.expiry_date = '2200-01-01') ;
/* 在 customer_name 列上 SCD1 */
UPDATE customer_dim a, customer_stg b
SET a.customer_name = b.customer_name
WHERE a.customer_number = b.customer_number
AND a.customer_name <> b.customer_name ;
/* 新增的客户 */
INSERT INTO customer_dim
SELECT
NULL
, customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, shipping_address
, shipping_zip_code
, shipping_city
, shipping_state
, 1
, @pre_date
,'2200-01-01'
FROM customer_stg
WHERE customer_number NOT IN(
SELECT y.customer_number
FROM customer_dim x, customer_stg y
WHERE x.customer_number = y.customer_number) ;
/* 重建PA客户维度 */
TRUNCATE pa_customer_dim;
INSERT INTO pa_customer_dim
SELECT
customer_sk
, customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, shipping_address
, shipping_zip_code
, shipping_city
, shipping_state
, version
, effective_date
, expiry_date
FROM customer_dim
WHERE customer_state = 'PA' ;
/* 装载产品维度 */
TRUNCATE TABLE product_stg ;
INSERT INTO product_stg
SELECT
product_code
, product_name
, product_category
FROM source.product ;
/* 在 product_name 和 product_category 列上 SCD2 */
/* 置过期 */
UPDATE
product_dim a
, product_stg b
SET
expiry_date = @pre_date
WHERE
a.product_code = b.product_code
AND ( a.product_name <> b.product_name
OR a.product_category <> b.product_category)
AND expiry_date = '2200-01-01';
/* 加新行 */
INSERT INTO product_dim
SELECT
NULL
, b.product_code
, b.product_name
, b.product_category
, a.version + 1
, @pre_date
,'2200-01-01'
FROM
product_dim a
, product_stg b
WHERE
a.product_code = b.product_code
AND ( a.product_name <> b.product_name
OR a.product_category <> b.product_category)
AND EXISTS(
SELECT *
FROM product_dim x
WHERE b.product_code = x.product_code
AND a.expiry_date = @pre_date)
AND NOT EXISTS (
SELECT *
FROM product_dim y
WHERE b.product_code = y.product_code
AND y.expiry_date = '2200-01-01') ;
/* 新增的产品 */
INSERT INTO product_dim
SELECT
NULL
, product_code
, product_name
, product_category
, 1
, @pre_date
, '2200-01-01'
FROM product_stg
WHERE product_code NOT IN(
SELECT y.product_code
FROM product_dim x, product_stg y
WHERE x.product_code = y.product_code) ;
/* PRODUCT_COUNT_FACT POPULATION */
TRUNCATE product_count_fact;
INSERT INTO product_count_fact(product_sk, product_launch_date_sk)
SELECT
a.product_sk
, b.date_sk
FROM
product_dim a
, date_dim b
WHERE
a.effective_date = b.date
GROUP BY product_code;
/* END OF PRODUCT_COUNT_FACT POPULATION */
-- 装载事实表,新增前一天的订单
INSERT INTO sales_order_fact
SELECT
customer_sk
, product_sk
, g.sales_order_attribute_sk
, e.order_date_sk
, NULL
, NULL
, NULL
, NULL
, a.order_number
, f.request_delivery_date_sk
, order_amount
, quantity
, NULL
, NULL
, NULL
, NULL
FROM
source.sales_order a
, customer_dim c
, product_dim d
, order_date_dim e
, request_delivery_date_dim f
, sales_order_attribute_dim g
, cdc_time h
WHERE
a.order_status = 'N'
AND a.customer_number = c.customer_number
AND a.status_date >= c.effective_date
AND a.status_date < c.expiry_date
AND a.product_code = d.product_code
AND a.status_date >= d.effective_date
AND a.status_date < d.expiry_date
AND a.status_date = e.order_date
AND a.request_delivery_date = f.request_delivery_date
AND a.verification_ind = g.verification_ind
AND a.credit_check_flag = g.credit_check_flag
AND a.new_customer_ind = g.new_customer_ind
AND a.web_order_flag = g.web_order_flag
AND a.entry_date >= h.last_load AND a.entry_date < h.current_load ;
/* UPDATING the new sales order to Allocated status */
UPDATE sales_order_fact a,
source.sales_order b,
allocate_date_dim c,
cdc_time h
SET
a.allocate_date_sk = c.allocate_date_sk,
a.allocate_quantity = b.quantity
WHERE
order_status = 'A'
AND b.entry_date >= h.last_load AND b.entry_date < h.current_load
AND b.order_number = a.order_number
AND c.allocate_date = b.status_date ;
/* UPDATING the allocated order to Packed status */
UPDATE sales_order_fact a,
source.sales_order b,
packing_date_dim d,
cdc_time h
SET
a.packing_date_sk = d.packing_date_sk,
a.packing_quantity = b.quantity
WHERE
order_status = 'P'
AND b.entry_date >= h.last_load AND b.entry_date < h.current_load
AND b.order_number = a.order_number
AND d.packing_date = b.status_date ;
/* UPDATING the packed order to Shipped status */
UPDATE sales_order_fact a,
source.sales_order b,
ship_date_dim e,
cdc_time h
SET
a.ship_date_sk = e.ship_date_sk,
a.ship_quantity = b.quantity
WHERE
order_status = 'S'
AND b.entry_date >= h.last_load AND b.entry_date < h.current_load
AND b.order_number = a.order_number
AND e.ship_date = b.status_date ;
/* UPDATING the shipped order to Received status */
UPDATE sales_order_fact a,
source.sales_order b,
receive_date_dim f,
cdc_time h
SET
a.receive_date_sk = f.receive_date_sk,
a.receive_quantity = b.quantity
WHERE
order_status = 'R'
AND b.entry_date >= h.last_load AND b.entry_date < h.current_load
AND b.order_number = a.order_number
AND f.receive_date = b.status_date ;
-- 更新时间戳表的last_load字段
UPDATE cdc_time SET last_load = current_load ;
COMMIT ;
清单(五)-12-3
使用Kettle作业定期装载,只对(五)进阶技术9. “杂项维度”中的定期装载做了两点修改:“清空过渡表”作业项加了清空product_count_fact表;把初始装载产品数量事实表的步骤合并到了“装载事实表(定期)”作业项里。如图(五)- 12-7、图(五)- 12-8所示。
图(五)- 12-7
图(五)- 12-8
测试定期装载
在测试修改的定期装载前,要做两点准备。
第一,修改源数据库的product表,把产品编码为1的产品名称改为‘Regular Hard Disk Drive’,并新增一个产品‘High End Hard Disk Drive’(产品编码为5)。执行清单(五)-12-4里的脚本完成此此修改。查询修改后的产品数据如下所示。
mysql> select * from source.product;
+--------------+--------------------------+------------------+
| product_code | product_name | product_category |
+--------------+--------------------------+------------------+
| 1 | Regular Hard Disk Drive | Storage |
| 2 | Floppy Drive | Storage |
| 3 | Flat Panel | Monitor |
| 4 | Keyboard | Peripheral |
| 5 | High End Hard Disk Drive | Storage |
+--------------+--------------------------+------------------+
5 rows in set (0.00 sec)
use source;
update product set product_name = 'Regular Hard Disk Drive' where product_code=1;
insert into product values (5, 'High End Hard Disk Drive', 'Storage');
commit;
清单(五)-12-4
现在执行清单(五)-12-3里的脚本或对应的Kettle作业,通过查询product_count_fact表确认定期装载执行正确。查询语句和结果如下所示。
mysql> SELECT
-> c.product_sk psk,
-> c.product_code pc,
-> b.product_launch_date_sk plsk,
-> b.product_launch_date pld
-> FROM
-> product_count_fact a,
-> product_launch_date_dim b,
-> product_dim c
-> WHERE
-> a.product_launch_date_sk = b.product_launch_date_sk
-> AND a.product_sk = c.product_sk
-> ORDER BY product_code , product_launch_date;
+-----+------+------+------------+
| psk | pc | plsk | pld |
+-----+------+------+------------+
| 1 | 1 | 4809 | 2013-03-01 |
| 2 | 2 | 4809 | 2013-03-01 |
| 3 | 3 | 4809 | 2013-03-01 |
| 5 | 4 | 5539 | 2015-03-01 |
| 7 | 5 | 5557 | 2015-03-19 |
+-----+------+------+------------+
5 rows in set (0.06 sec)