数据仓库之拉链表(二)使用拉链表更新数据

前提假设:

  • 数据仓库中订单历史表的刷新频率为一天,当天更新前一天的增量数据;

  • 如果一个订单在一天内有多次状态变化,则只会记录最后一个状态的历史;

  • 订单状态包括三个:创建、支付、完成

  • 创建时间和修改时间只取到天,如果原订单表中没有状态修改时间,那么抽取增量就比较麻烦,需要有个机制来确保能抽取到每天的增量数据;

  • 原系统中订单表结构为:
    CREATE TABLE orders (
    orderid INT,
    createtime STRING,
    modifiedtime STRING,
    status STRING
    ) stored AS textfile;

  • 在数据仓库的ODS层,有一张订单的增量数据表,按天分区,存放每天的增量数据:
    CREATE TABLE t_ods_orders_inc (
    orderid INT,
    createtime STRING,
    modifiedtime STRING,
    status STRING
    ) PARTITIONED BY (day STRING)
    stored AS textfile;

  • 在数据仓库的DW层,有一张订单的历史数据拉链表,存放订单的历史状态数据:
    CREATE TABLE t_dw_orders_his (
    orderid INT,
    createtime STRING,
    modifiedtime STRING,
    status STRING,
    dw_start_date STRING,
    dw_end_date STRING
    ) stored AS textfile;

  • 暂未考虑Hive上表的查询性能问题,只实现功能;

每天的原系统订单表的数据如下(红色标出的为当天发生变化的订单,即增量数据):
2015-08-21订单表数据:

订单编号订单创建时间订单修改时间订单状态
orderidcreatetimemodifiedtimestatus
12015-08-182015-08-18创建
22015-08-182015-08-18创建
32015-08-192015-08-21支付
42015-08-192015-08-21完成
52015-08-192015-08-20支付
62015-08-202015-08-20创建
72015-08-202015-08-21支付
82015-08-212015-08-21创建

2015-08-22订单表数据:

订单编号订单创建时间订单修改时间订单状态
orderidcreatetimemodifiedtimestatus
12015-08-182015-08-22支付
22015-08-182015-08-22完成
32015-08-192015-08-21支付
42015-08-192015-08-21完成
52015-08-192015-08-20支付
62015-08-202015-08-22支付
72015-08-202015-08-21支付
82015-08-212015-08-22支付
92015-08-222015-08-22创建
102015-08-222015-08-22支付

2015-08-23订单表数据:

订单编号订单创建时间订单修改时间订单状态
orderidcreatetimemodifiedtimestatus
12015-08-182015-08-23完成
22015-08-182015-08-22完成
32015-08-192015-08-23完成
42015-08-192015-08-21完成
52015-08-192015-08-23完成
62015-08-202015-08-22支付
72015-08-202015-08-21支付
82015-08-212015-08-23完成
92015-08-222015-08-22创建
102015-08-222015-08-22支付
112015-08-232015-08-23创建
122015-08-232015-08-23创建
132015-08-232015-08-23支付

全量初始化

在数据从原业务系统每天正常抽取和刷新到dw订单历史表之前,需要做一次全量的初始化,就是从原订单表中昨天以前的数据全部抽取到ods,并刷新到dw

以上面的数据为例,比如在2015-08-21这天做全量初始化,那么我需要将包括2015-08-20之前的所有的数据都抽取并刷新到dw:
第一步,抽取全量数据到ods

insert overwrite table t_ods_orders_inc partition(day=‘2015-08-20’) select orderid,createtime,modifiedtime,status from orders where createtime<=‘2015-08-20’;

第二步,从ods刷新到dw:
insert overwrite table t_dw_orders_his
select orderid,createtime,modifiedtime,status,createtime as dw_start_date,‘9999-12-31’ as dw_end_date from t_ods_orders_inc where day=‘2015-08-20’;

完成后,dw订单历史表中数据如下:

<span style="color:#000000"><code>spark-sql> select * from t_dw_orders_his;
1       2015-08-18      2015-08-18      创建    2015-08-18      9999-12-31      
2       2015-08-18      2015-08-18      创建    2015-08-18      9999-12-31
3       2015-08-19      2015-08-21      支付    2015-08-19      9999-12-31
4       2015-08-19      2015-08-21      完成    2015-08-19      9999-12-31
5       2015-08-19      2015-08-20      支付    2015-08-19      9999-12-31
6       2015-08-20      2015-08-20      创建    2015-08-20      9999-12-31
7       2015-08-20      2015-08-21      支付    2015-08-20      9999-12-31
</code></span>

增量抽取

每天,从原系统订单表中,将前一天的增量数据抽取到ods层的增量数据表中;这里的增量需要通过订单表中的创建时间和修改时间来确定:
insert overwrite table t_ods_orders_inc partition(day=’{day}’)
select orderid,createtime,modifiedtime,status from orders where createtime=’{day}’ or modifiedtime=’{day}’;

注意:在ods层按天分区的增量表,最好保留一段时间的数据,比如半年,为了防止某一天的数据有问题而回滚重做数据。

增量刷新历史数据

从2015-08-22开始,需要每天正常刷新前一天(2015-08–21)的增量数据到历史表

第一步,通过增量抽取,将2015-08-21的数据抽取到ods:

insert overwrite table t_ods_orders_inc partition(day=‘2015-08-21’)
select orderid,createtime,modifiedtime,status
from orders where createtime=‘2015-08-21’ or modifiedtime=‘2015-08-21’;

ods增量表中2015-08-21的数据如下:

<span style="color:#000000"><code>spark-sql> select * from t_ods_orders_inc where day = '2015-08-21';
3       2015-08-19      2015-08-21      支付    2015-08-21
4       2015-08-19      2015-08-21      完成    2015-08-21
7       2015-08-20      2015-08-21      支付    2015-08-21
8       2015-08-21      2015-08-21      创建    2015-08-21
</code></span>

第二步,通过dw历史数据(数据日期为2015-08-20)和ods增量数据(2015-08-21)刷新历史表:
先把数据放到一张临时表中:

<span style="color:#000000"><code>DROP TABLE IF EXISTS t_dw_orders_his_tmp;
CREATE TABLE t_dw_orders_his_tmp AS 
SELECT orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date 
FROM (
    SELECT a.orderid,
    a.createtime,
    a.modifiedtime,
    a.status,
    a.dw_start_date,
    CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2015-08-21' THEN '2015-08-20' ELSE a.dw_end_date END AS dw_end_date 
    FROM t_dw_orders_his a 
    left outer join (SELECT * FROM t_ods_orders_inc WHERE day = '2015-08-21') b 
    ON (a.orderid = b.orderid) 
    UNION
    SELECT orderid,
    createtime,
    modifiedtime,
    status,
    modifiedtime AS dw_start_date,
    '9999-12-31' AS dw_end_date 
    FROM t_ods_orders_inc 
    WHERE day = '2015-08-21' 
) x 
ORDER BY orderid,dw_start_date;
</code></span>

其中:
UNION ALL的两个结果集中,第一个是用历史表left outer join 日期为 {yyy-MM-dd} 的增量,能关联上的,并且dw_end_date > {yyy-MM-dd},说明状态有变化,则把原来的dw_end_date置为({yyy-MM-dd} – 1), 关联不上的,说明状态无变化,dw_end_date无变化。

第二个结果集是直接将增量数据插入历史表。

最后把临时表中数据插入历史表:
INSERT overwrite TABLE t_dw_orders_his
SELECT * FROM t_dw_orders_his_tmp;

刷新完后,历史表中数据如下:

<span style="color:#000000"><code>spark-sql> select * from t_dw_orders_his order by orderid,dw_start_date;
1       2015-08-18      2015-08-18      创建    2015-08-18      9999-12-31
2       2015-08-18      2015-08-18      创建    2015-08-18      9999-12-31
3       2015-08-19      2015-08-21      支付    2015-08-19      2015-08-20
3       2015-08-19      2015-08-21      支付    2015-08-21      9999-12-31
4       2015-08-19      2015-08-21      完成    2015-08-19      2015-08-20
4       2015-08-19      2015-08-21      完成    2015-08-21      9999-12-31
5       2015-08-19      2015-08-20      支付    2015-08-19      9999-12-31
6       2015-08-20      2015-08-20      创建    2015-08-20      9999-12-31
7       2015-08-20      2015-08-21      支付    2015-08-20      2015-08-20
7       2015-08-20      2015-08-21      支付    2015-08-21      9999-12-31
8       2015-08-21      2015-08-21      创建    2015-08-21      9999-12-31
</code></span>

由于在2015-08-21做了8月20日以前的数据全量初始化,而订单3,4,7在2015-08-21的增量数据中也存在,因此都有两条记录,但不影响后面的查询。

将2015-08-22的增量数据刷新到历史表

<span style="color:#000000"><code>INSERT overwrite TABLE t_ods_orders_inc PARTITION (day = '2015-08-22') 
SELECT orderid,createtime,modifiedtime,status 
FROM orders 
WHERE createtime = '2015-08-22' OR modifiedtime = '2015-08-22';
 
DROP TABLE IF EXISTS t_dw_orders_his_tmp;
CREATE TABLE t_dw_orders_his_tmp AS 
SELECT orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date 
FROM (
    SELECT a.orderid,
    a.createtime,
    a.modifiedtime,
    a.status,
    a.dw_start_date,
    CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2015-08-22' THEN '2015-08-21' ELSE a.dw_end_date END AS dw_end_date 
    FROM t_dw_orders_his a 
    left outer join (SELECT * FROM t_ods_orders_inc WHERE day = '2015-08-22') b 
    ON (a.orderid = b.orderid) 
    UNION ALL 
    SELECT orderid,
    createtime,
    modifiedtime,
    status,
    modifiedtime AS dw_start_date,
    '9999-12-31' AS dw_end_date 
    FROM t_ods_orders_inc 
    WHERE day = '2015-08-22' 
) x 
ORDER BY orderid,dw_start_date;
 
 
INSERT overwrite TABLE t_dw_orders_his 
SELECT * FROM t_dw_orders_his_tmp;
</code></span>

刷新完成后历史拉链表数据如下:

<span style="color:#000000"><code>spark-sql> select * from t_dw_orders_his order by orderid,dw_start_date;
1       2015-08-18      2015-08-18      创建    2015-08-18      2015-08-21
1       2015-08-18      2015-08-22      支付    2015-08-22      9999-12-31
2       2015-08-18      2015-08-18      创建    2015-08-18      2015-08-21
2       2015-08-18      2015-08-22      完成    2015-08-22      9999-12-31
3       2015-08-19      2015-08-21      支付    2015-08-19      2015-08-20
3       2015-08-19      2015-08-21      支付    2015-08-21      9999-12-31
4       2015-08-19      2015-08-21      完成    2015-08-19      2015-08-20
4       2015-08-19      2015-08-21      完成    2015-08-21      9999-12-31
5       2015-08-19      2015-08-20      支付    2015-08-19      9999-12-31
6       2015-08-20      2015-08-20      创建    2015-08-20      2015-08-21
6       2015-08-20      2015-08-22      支付    2015-08-22      9999-12-31
7       2015-08-20      2015-08-21      支付    2015-08-20      2015-08-20
7       2015-08-20      2015-08-21      支付    2015-08-21      9999-12-31
8       2015-08-21      2015-08-21      创建    2015-08-21      2015-08-21
8       2015-08-21      2015-08-22      支付    2015-08-22      9999-12-31
9       2015-08-22      2015-08-22      创建    2015-08-22      9999-12-31
10      2015-08-22      2015-08-22      支付    2015-08-22      9999-12-31
</code></span>

查看2015-08-21的历史快照数据:

<span style="color:#000000"><code>spark-sql> select * from t_dw_orders_his where dw_start_date <= '2015-08-21' and dw_end_date >= '2015-08-21';
1       2015-08-18      2015-08-18      创建    2015-08-18      2015-08-21
2       2015-08-18      2015-08-18      创建    2015-08-18      2015-08-21
3       2015-08-19      2015-08-21      支付    2015-08-21      9999-12-31
4       2015-08-19      2015-08-21      完成    2015-08-21      9999-12-31
5       2015-08-19      2015-08-20      支付    2015-08-19      9999-12-31
6       2015-08-20      2015-08-20      创建    2015-08-20      2015-08-21
7       2015-08-20      2015-08-21      支付    2015-08-21      9999-12-31
8       2015-08-21      2015-08-21      创建    2015-08-21      2015-08-21
</code></span>

订单1在2015-08-21的时候还处于创建的状态,在2015-08-22的时候状态变为支付。

再刷新2015-08-23的增量数据:

<span style="color:#000000"><code>spark-sql> select * from t_dw_orders_his order by orderid,dw_start_date;
1       2015-08-18      2015-08-18      创建    2015-08-18      2015-08-21
1       2015-08-18      2015-08-22      支付    2015-08-22      2015-08-22
1       2015-08-18      2015-08-23      完成    2015-08-23      9999-12-31
2       2015-08-18      2015-08-18      创建    2015-08-18      2015-08-21
2       2015-08-18      2015-08-22      完成    2015-08-22      9999-12-31
3       2015-08-19      2015-08-21      支付    2015-08-19      2015-08-20
3       2015-08-19      2015-08-21      支付    2015-08-21      2015-08-22
3       2015-08-19      2015-08-23      完成    2015-08-23      9999-12-31
4       2015-08-19      2015-08-21      完成    2015-08-19      2015-08-20
4       2015-08-19      2015-08-21      完成    2015-08-21      9999-12-31
5       2015-08-19      2015-08-20      支付    2015-08-19      2015-08-22
5       2015-08-19      2015-08-23      完成    2015-08-23      9999-12-31
6       2015-08-20      2015-08-20      创建    2015-08-20      2015-08-21
6       2015-08-20      2015-08-22      支付    2015-08-22      9999-12-31
7       2015-08-20      2015-08-21      支付    2015-08-20      2015-08-20
7       2015-08-20      2015-08-21      支付    2015-08-21      9999-12-31
8       2015-08-21      2015-08-21      创建    2015-08-21      2015-08-21
8       2015-08-21      2015-08-22      支付    2015-08-22      2015-08-22
8       2015-08-21      2015-08-23      完成    2015-08-23      9999-12-31
9       2015-08-22      2015-08-22      创建    2015-08-22      9999-12-31
10      2015-08-22      2015-08-22      支付    2015-08-22      9999-12-31
11      2015-08-23      2015-08-23      创建    2015-08-23      9999-12-31
12      2015-08-23      2015-08-23      创建    2015-08-23      9999-12-31
13      2015-08-23      2015-08-23      支付    2015-08-23      9999-12-31
</code></span>

订单1从20号-23号,状态变化了三次,历史表中有三条记录

<span style="color:#000000"><code>//查看2015-08-22当天的历史快照,可以看出,和上面图中2015-08-22时候订单表中的数据是一样的
spark-sql> select * from t_dw_orders_his where dw_start_date <= '2015-08-22' and dw_end_date >= '2015-08-22';
1       2015-08-18      2015-08-22      支付    2015-08-22      2015-08-22
2       2015-08-18      2015-08-22      完成    2015-08-22      9999-12-31
3       2015-08-19      2015-08-21      支付    2015-08-21      2015-08-22
4       2015-08-19      2015-08-21      完成    2015-08-21      9999-12-31
5       2015-08-19      2015-08-20      支付    2015-08-19      2015-08-22
6       2015-08-20      2015-08-22      支付    2015-08-22      9999-12-31
7       2015-08-20      2015-08-21      支付    2015-08-21      9999-12-31
8       2015-08-21      2015-08-22      支付    2015-08-22      2015-08-22
9       2015-08-22      2015-08-22      创建    2015-08-22      9999-12-31
10      2015-08-22      2015-08-22      支付    2015-08-22      9999-12-31
Time taken: 0.328 seconds, Fetched 10 row(s)
//查看当前所有订单的最新状态
spark-sql> select * from t_dw_orders_his where dw_end_date = '9999-12-31';                                   
1       2015-08-18      2015-08-23      完成    2015-08-23      9999-12-31
2       2015-08-18      2015-08-22      完成    2015-08-22      9999-12-31
3       2015-08-19      2015-08-23      完成    2015-08-23      9999-12-31
4       2015-08-19      2015-08-21      完成    2015-08-21      9999-12-31
5       2015-08-19      2015-08-23      完成    2015-08-23      9999-12-31
6       2015-08-20      2015-08-22      支付    2015-08-22      9999-12-31
7       2015-08-20      2015-08-21      支付    2015-08-21      9999-12-31
8       2015-08-21      2015-08-23      完成    2015-08-23      9999-12-31
9       2015-08-22      2015-08-22      创建    2015-08-22      9999-12-31
10      2015-08-22      2015-08-22      支付    2015-08-22      9999-12-31
11      2015-08-23      2015-08-23      创建    2015-08-23      9999-12-31
12      2015-08-23      2015-08-23      创建    2015-08-23      9999-12-31
13      2015-08-23      2015-08-23      支付    2015-08-23      9999-12-31</code></span>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值