数据仓库拉链表

增量数据与历史库做成拉链表


首先介绍几个表概念:

全量表:每天的所有的最新状态的数据

增量表:每天的新增数据

拉链表:维护历史状态,以及最新状态数据

流水表:对于表中的每一个修改都会记录,可以用于反映实际记录的变更

拉链表VS流水表:

拉链表:通常是对帐户信息的历史变动进行处理保留的结果;用于统计业务相关情况
流水表:每天的交易形成的历史;用于统计账户及客户的情况

在数据仓库的数据模型设计过程中,经常会遇到这样的需求:

  1. 数据量比较大;

  2. 表中的部分字段会被update,如用户的地址,产品的描述信息,订单的状态等等;

  3. 需要查看某一个时间点或者时间段的历史快照信息,比如,查看某一个订单在历史某一个时间点的状态,
    比如,查看某一个用户在过去某一段时间内,更新过几次等等;

  4. 变化的比例和频率不是很大,比如,总共有1000万的会员,每天新增和发生变化的有10万左右;

  5. 如果对这边表每天都保留一份全量,那么每次全量中会保存很多不变的信息,对存储是极大的浪费;
    遇到些问题在没解决吗?

    拉链历史表,既能满足反应数据的历史状态,又可以最大程度的节省存储。

    拉链表案例:

    1、有一张订单表,2019-06-20 这天里面有三条订单数据:

在这里插入图片描述
到了2019-06-21这天,表中有5条记录:

在这里插入图片描述
到了2019-06-22日,表中有6条记录:

在这里插入图片描述
这种存储方式有什么问题呢?

1、只保留一份全量,则数据和6月22日的记录一样,如果需要查看6月21日订单001的状态,则无法满足;

2、每天都保留一份全量,则数据仓库中的该表共有14条记录,但好多记录都是重复保存,没有任务变化,如订单002,004,数据量大了,会造成很大的存储浪费;

如果设计成历史拉链表,如下:

在这里插入图片描述
拉链表中字段解释:

  1. dw_begin_date表示该条记录的生命周期开始时间,dw_end_date表示该条记录的生命周期结束时间;
  2. dw_end_date = '9999-12-31’表示该条记录目前处于有效状态;
  3. 如果查询当前所有有效的记录,则select * from order_his where dw_end_date = ‘9999-12-31’
  4. 如果查询2019-06-21的历史快照,则select * from order_his where dw_begin_date <= ‘2019-06-21’ and dw_end_date >= ‘2019-06-21’,这条语句会查询到以下记录:
    在这里插入图片描述
    可以看出,这样的历史拉链表,既能满足对历史数据的需求,又能很大程度的节省存储资源;

举例:

源系统中订单表结构为:

CREATE TABLE orders_20190821 (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored AS textfile;


CREATE TABLE orders_20190822 (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored AS textfile;


CREATE TABLE orders_20190823 (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored AS textfile;

插入数据:

2019-08-21订单数据表
1,2019-08-18,2019-08-18,创建
2,2019-08-18,2019-08-18,创建
3,2019-08-19,2019-08-21,支付
4,2019-08-19,2019-08-21,完成
5,2019-08-19,2019-08-20,支付
6,2019-08-20,2019-08-20,创建
7,2019-08-20,2019-08-21,支付
8,2019-08-21,2019-08-21,创建


2019-08-22订单数据表
1,2019-08-18,2019-08-22,支付
2,2019-08-18,2019-08-22,完成
3,2019-08-19,2019-08-21,支付
4,2019-08-19,2019-08-21,完成
5,2019-08-19,2019-08-20,支付
6,2019-08-20,2019-08-22,支付
7,2019-08-20,2019-08-21,支付
8,2019-08-21,2019-08-22,支付
9,2019-08-22,2019-08-22,创建
10,2019-08-22,2019-08-22,支付


2019-08-23订单数据表
1,2019-08-18,2019-08-23,完成
2,2019-08-18,2019-08-22,完成
3,2019-08-19,2019-08-23,完成
4,2019-08-19,2019-08-21,完成
5,2019-08-19,2019-08-23,完成
6,2019-08-20,2019-08-22,支付
7,2019-08-20,2019-08-21,支付
8,2019-08-21,2019-08-23,完成
9,2019-08-22,2019-08-22,创建
10,2019-08-22,2019-08-22,支付
11,2019-08-23,2019-08-23,创建
12,2019-08-23,2019-08-23,创建
13,2019-08-23,2019-08-23,支付

ODS层设计:

在数据仓库的ODS层,有一张订单的增量数据表,按天分区,存放每天的增量数据:

CREATE TABLE ods_orders_inc (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
)PARTITIONED BY (day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored AS textfile;

DW层设计:

在数据仓库的DW层,有一张订单的历史数据拉链表,存放订单的历史状态数据:

CREATE TABLE dw_orders_his (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING,
dw_start_date STRING,
dw_end_date STRING
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored AS textfile;

第一步,抽取全量数据到ODS:

INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-20')
SELECT orderid,createtime,modifiedtime,status
FROM orders_20190821
WHERE createtime <= '2019-08-20';

第二步,从ODS刷新到DW:

INSERT overwrite TABLE dw_orders_his
SELECT orderid,createtime,modifiedtime,status,
createtime AS dw_start_date,
'9999-12-31' AS dw_end_date
FROM ods_orders_inc
WHERE day = '2019-08-20';

完成后,DW订单历史表中数据:

+------------------------+---------------------------+-----------------------------+-----------------------+------------------------------+----------------------------+--+
| dw_orders_his.orderid  | dw_orders_his.createtime  | dw_orders_his.modifiedtime  | dw_orders_his.status  | dw_orders_his.dw_start_date  | dw_orders_his.dw_end_date  |
+------------------------+---------------------------+-----------------------------+-----------------------+------------------------------+----------------------------+--+
| 1                      | 2019-08-18                | 2019-08-18                  | 创建                    | 2019-08-18                   | 9999-12-31                 |
| 2                      | 2019-08-18                | 2019-08-18                  | 创建                    | 2019-08-18                   | 9999-12-31                 |
| 3                      | 2019-08-19                | 2019-08-21                  | 支付                    | 2019-08-19                   | 9999-12-31                 |
| 4                      | 2019-08-19                | 2019-08-21                  | 完成                    | 2019-08-19                   | 9999-12-31                 |
| 5                      | 2019-08-19                | 2019-08-20                  | 支付                    | 2019-08-19                   | 9999-12-31                 |
| 6                      | 2019-08-20                | 2019-08-20                  | 创建                    | 2019-08-20                   | 9999-12-31                 |
| 7                      | 2019-08-20                | 2019-08-21                  | 支付                    | 2019-08-20                   | 9999-12-31                 |
+------------------------+---------------------------+-----------------------------+-----------------------+------------------------------+----------------------------+--+

增量抽取

每天,从源系统订单表中,将前一天的增量数据抽取到ODS层的增量数据表。这里的增量需要通过订单表中的创建时间和修改时间来确定:

INSERT overwrite TABLE t_ods_orders_inc PARTITION (day = '${day}')
SELECT orderid,createtime,modifiedtime,status
FROM orders
WHERE createtime = '${day}' OR modifiedtime = '${day}';

增量刷新历史数据

从2019-08-22开始,需要每天正常刷新前一天(2019-08-21)的增量数据到历史表。

第一步,通过增量抽取,将2019-08-21的数据抽取到ODS:

INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-21')
SELECT orderid,createtime,modifiedtime,status
FROM orders_20190821
WHERE modifiedtime = '2019-08-21' or createtime = '2019-08-21';

ODS增量表中2019-08-21的数据如下:


+-------------------------+----------------------------+------------------------------+------------------------+---------------------+--+
| ods_orders_inc.orderid  | ods_orders_inc.createtime  | ods_orders_inc.modifiedtime  | ods_orders_inc.status  | ods_orders_inc.day  |
+-------------------------+----------------------------+------------------------------+------------------------+---------------------+--+
| 3                       | 2019-08-19                 | 2019-08-21                   | 支付                     | 2019-08-21          |
| 4                       | 2019-08-19                 | 2019-08-21                   | 完成                     | 2019-08-21          |
| 7                       | 2019-08-20                 | 2019-08-21                   | 支付                     | 2019-08-21          |
| 8                       | 2019-08-21                 | 2019-08-21                   | 创建                     | 2019-08-21          |
+-------------------------+----------------------------+------------------------------+------------------------+---------------------+--+

第二步,通过DW历史数据(数据日期为2019-08-20),和ODS增量数据(2019-08-21),刷新历史表:

先把数据放到一张临时表中:

DROP TABLE IF EXISTS dw_orders_his_tmp;
CREATE TABLE dw_orders_his_tmp AS 
SELECT orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date 
FROM (
    SELECT a.orderid,
    a.createtime,
    a.modifiedtime,
    a.status,
    a.dw_start_date,
    CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2019-08-21' THEN '2019-08-20' ELSE a.dw_end_date END AS dw_end_date 
    FROM dw_orders_his a 
    left outer join (SELECT * FROM ods_orders_inc WHERE day = '2019-08-21') b 
    ON (a.orderid = b.orderid) 
    UNION ALL 
    SELECT orderid,
    createtime,
    modifiedtime,
    status,
    modifiedtime AS dw_start_date,
    '9999-12-31' AS dw_end_date 
    FROM ods_orders_inc 
    WHERE day = '2019-08-21' 
) x 
ORDER BY orderid,dw_start_date;

最后把临时表中数据插入历史表:

INSERT overwrite TABLE dw_orders_his
SELECT * FROM dw_orders_his_tmp;

查看拉链表数据:

+------------------------+---------------------------+-----------------------------+-----------------------+------------------------------+----------------------------+--+
| dw_orders_his.orderid  | dw_orders_his.createtime  | dw_orders_his.modifiedtime  | dw_orders_his.status  | dw_orders_his.dw_start_date  | dw_orders_his.dw_end_date  |
+------------------------+---------------------------+-----------------------------+-----------------------+------------------------------+----------------------------+--+
| 1                      | 2019-08-18                | 2019-08-18                  | 创建                    | 2019-08-18                   | 9999-12-31                 |
| 2                      | 2019-08-18                | 2019-08-18                  | 创建                    | 2019-08-18                   | 9999-12-31                 |
| 3                      | 2019-08-19                | 2019-08-21                  | 支付                    | 2019-08-19                   | 2019-08-20                 |
| 3                      | 2019-08-19                | 2019-08-21                  | 支付                    | 2019-08-21                   | 9999-12-31                 |
| 4                      | 2019-08-19                | 2019-08-21                  | 完成                    | 2019-08-19                   | 2019-08-20                 |
| 4                      | 2019-08-19                | 2019-08-21                  | 完成                    | 2019-08-21                   | 9999-12-31                 |
| 5                      | 2019-08-19                | 2019-08-20                  | 支付                    | 2019-08-19                   | 9999-12-31                 |
| 6                      | 2019-08-20                | 2019-08-20                  | 创建                    | 2019-08-20                   | 9999-12-31                 |
| 7                      | 2019-08-20                | 2019-08-21                  | 支付                    | 2019-08-20                   | 2019-08-20                 |
| 7                      | 2019-08-20                | 2019-08-21                  | 支付                    | 2019-08-21                   | 9999-12-31                 |
| 8                      | 2019-08-21                | 2019-08-21                  | 创建                    | 2019-08-21                   | 9999-12-31                 |
+------------------------+---------------------------+-----------------------------+-----------------------+------------------------------+----------------------------+--+

将2019-08-22的增量数据刷新到历史表:


第一步,通过增量抽取,将2019-08-22的数据抽取到ODS:


INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-22') 
SELECT orderid,createtime,modifiedtime,status 
FROM orders_20190822
WHERE createtime = '2019-08-22' OR modifiedtime = '2019-08-22';


第二步,通过DW历史数据(数据日期为2019-08-21),和ODS增量数据(2019-08-22)


DROP TABLE IF EXISTS dw_orders_his_tmp;
CREATE TABLE dw_orders_his_tmp AS 
SELECT orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date 
FROM (
    SELECT a.orderid,
    a.createtime,
    a.modifiedtime,
    a.status,
    a.dw_start_date,
    CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2019-08-22' THEN '2019-08-21' ELSE a.dw_end_date END AS dw_end_date 
    FROM dw_orders_his a 
    left outer join (SELECT * FROM ods_orders_inc WHERE day = '2019-08-22') b 
    ON (a.orderid = b.orderid) 
    UNION ALL 
    SELECT orderid,
    createtime,
    modifiedtime,
    status,
    modifiedtime AS dw_start_date,
    '9999-12-31' AS dw_end_date 
    FROM ods_orders_inc 
    WHERE day = '2019-08-22' 
) x 
ORDER BY orderid,dw_start_date;
 

第三步:最后把临时表中数据插入历史表:


INSERT overwrite TABLE dw_orders_his 
SELECT * FROM dw_orders_his_tmp;

刷新完后历史表数据如下:

hive> select * from dw_orders_his order by orderid,dw_start_date;


+------------------------+---------------------------+-----------------------------+-----------------------+------------------------------+----------------------------+--+
| dw_orders_his.orderid  | dw_orders_his.createtime  | dw_orders_his.modifiedtime  | dw_orders_his.status  | dw_orders_his.dw_start_date  | dw_orders_his.dw_end_date  |
+------------------------+---------------------------+-----------------------------+-----------------------+------------------------------+----------------------------+--+
| 1                      | 2019-08-18                | 2019-08-18                  | 创建                    | 2019-08-18                   | 2019-08-21                 |
| 1                      | 2019-08-18                | 2019-08-22                  | 支付                    | 2019-08-22                   | 9999-12-31                 |
| 2                      | 2019-08-18                | 2019-08-18                  | 创建                    | 2019-08-18                   | 2019-08-21                 |
| 2                      | 2019-08-18                | 2019-08-22                  | 完成                    | 2019-08-22                   | 9999-12-31                 |
| 3                      | 2019-08-19                | 2019-08-21                  | 支付                    | 2019-08-19                   | 2019-08-20                 |
| 3                      | 2019-08-19                | 2019-08-21                  | 支付                    | 2019-08-21                   | 9999-12-31                 |
| 4                      | 2019-08-19                | 2019-08-21                  | 完成                    | 2019-08-19                   | 2019-08-20                 |
| 4                      | 2019-08-19                | 2019-08-21                  | 完成                    | 2019-08-21                   | 9999-12-31                 |
| 5                      | 2019-08-19                | 2019-08-20                  | 支付                    | 2019-08-19                   | 9999-12-31                 |
| 6                      | 2019-08-20                | 2019-08-20                  | 创建                    | 2019-08-20                   | 2019-08-21                 |
| 6                      | 2019-08-20                | 2019-08-22                  | 支付                    | 2019-08-22                   | 9999-12-31                 |
| 7                      | 2019-08-20                | 2019-08-21                  | 支付                    | 2019-08-20                   | 2019-08-20                 |
| 7                      | 2019-08-20                | 2019-08-21                  | 支付                    | 2019-08-21                   | 9999-12-31                 |
| 8                      | 2019-08-21                | 2019-08-21                  | 创建                    | 2019-08-21                   | 2019-08-21                 |
| 8                      | 2019-08-21                | 2019-08-22                  | 支付                    | 2019-08-22                   | 9999-12-31                 |
| 9                      | 2019-08-22                | 2019-08-22                  | 创建                    | 2019-08-22                   | 9999-12-31                 |
| 10                     | 2019-08-22                | 2019-08-22                  | 支付                    | 2019-08-22                   | 9999-12-31                 |
+------------------------+---------------------------+-----------------------------+-----------------------+------------------------------+----------------------------+--+

查看2019-08-21的历史快照数据:

select * from dw_orders_his where dw_start_date <= '2019-08-21' and dw_end_date >= '2019-08-21';


+------------------------+---------------------------+-----------------------------+-----------------------+------------------------------+----------------------------+--+
| dw_orders_his.orderid  | dw_orders_his.createtime  | dw_orders_his.modifiedtime  | dw_orders_his.status  | dw_orders_his.dw_start_date  | dw_orders_his.dw_end_date  |
+------------------------+---------------------------+-----------------------------+-----------------------+------------------------------+----------------------------+--+
| 1                      | 2019-08-18                | 2019-08-18                  | 创建                    | 2019-08-18                   | 2019-08-21                 |
| 2                      | 2019-08-18                | 2019-08-18                  | 创建                    | 2019-08-18                   | 2019-08-21                 |
| 3                      | 2019-08-19                | 2019-08-21                  | 支付                    | 2019-08-21                   | 9999-12-31                 |
| 4                      | 2019-08-19                | 2019-08-21                  | 完成                    | 2019-08-21                   | 9999-12-31                 |
| 5                      | 2019-08-19                | 2019-08-20                  | 支付                    | 2019-08-19                   | 9999-12-31                 |
| 6                      | 2019-08-20                | 2019-08-20                  | 创建                    | 2019-08-20                   | 2019-08-21                 |
| 7                      | 2019-08-20                | 2019-08-21                  | 支付                    | 2019-08-21                   | 9999-12-31                 |
| 8                      | 2019-08-21                | 2019-08-21                  | 创建                    | 2019-08-21                   | 2019-08-21                 |
+------------------------+---------------------------+-----------------------------+-----------------------+------------------------------+----------------------------+--+

将2019-08-23的增量数据刷新到历史表方法跟步骤跟上述一样,不再赘述。

  • 0
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值