从快照表到拉链表的构建方式(hive)
注意:本文主要讲解在没有新增标识(最后修改时间)的前提下,通过每个字段的比对变更将存量+后续增量的快照表
转换成对应的拉链表
。
初始数据和存量数据如下:
快照表->拉链表初始化!!!!!!!!!!!!!
NOTE:这次实践主要是针对hive中的拉链表,通过快照表将hive中得到快照维度表一次性初始化转换成拉链表
AUTHOR:SHUFANGGEGE
start:原始的快照表数据
id name inc_day
A a1 20220101
A a1 20220102
A a1 20220104
A a2 20220105
A a3 20220106
A a3 20220107
A a2 20220108
A a1 20220109
B b1 20220102
A b2 20220106
end:目标结果的快照数据
id name dw_start_date dw_end_date
A a1 20220101 20220105
A a2 20220105 20220106
A a3 20220106 20220108
A a2 20220108 20220109
A a1 20220109 99991231
B b1 20220102 20220106
B b2 20220106 99991231
过程分2步骤:
- 从存量快照初始化拉链 (1)
- 增量合并拉链 (2)
1 从存量快照初始化拉链
1.1 具体思路与数据变化过程
1.2 初始化代码
--创建模拟的快照表
DROP TABLE test_zip;
CREATE TABLE test_zip stored as parquet as
select 'A' as id ,'a1' as name, '20220101' as inc_day
union all select 'A' as id , 'a1' as name , '20220102' as inc_day
union all select 'B' as id , 'b1' as name , '20220102' as inc_day
union all select 'A' as id , 'a1' as name , '20220104' as inc_day
union all select 'A' as id , 'a2' as name , '20220105' as inc_day
union all select 'B' as id , 'b2' as name , '20220106' as inc_day
union all select 'A' as id , 'a3' as name , '20220106' as inc_day
union all select 'A' as id , 'a3' as name , '20220107' as inc_day
union all select 'A' as id , 'a2' as name , '20220108' as inc_day
union all select 'A' as id , 'a1' as name , '20220109' as inc_day
--通过快照表初始化我们需要的拉链
SELECT
id,
max(name) as name,
max(dw_start_date) as dw_start_date,
max(dw_end_date) as dw_end_date
FROM(
--END_DATE
SELECT
id,
name,
NULL AS dw_start_date,
dw_end_date,
ROW_NUMBER() OVER(partition by id order by inc_day) AS rn
FROM(
SELECT
id,
name,
nvl(lead(name,1) over(partition by id order by inc_day),'NULL') as name_o,
inc_day,
nvl(lead(inc_day,1) over(partition by id order by inc_day),'99991231') as dw_end_date
FROM test_zip
) t
WHERE name_o <> name
UNION ALL
--START_DATE
SELECT
id,
name,
inc_day AS dw_start_date,
NULL AS dw_end_date,
ROW_NUMBER() OVER(partition by id order by inc_day) AS rn
FROM(
SELECT
id,
name,
nvl(LAG(name,1) over(partition by id order by inc_day),'NULL') as name_o,
inc_day
--nvl(lag(inc_day,1) over(partition by id order by inc_day),inc_day) as dw_end_date
FROM test_zip
) t1
WHERE name_o <> name
) t2
GROUP BY id,rn order by id,rn
2 增量合并拉链
在存量快照更新到拉链表中后,需要在调度上每日将新增及变化的数据与历史拉链进行合并,具体的要求如下:
#历史拉链,start到20220109
id name dw_start_date dw_end_date
A a1 20220101 20220105
A a2 20220105 20220106
A a3 20220106 20220108
A a2 20220108 20220109
A a1 20220109 99991231(最新)
B b1 20220102 20220106
B b2 20220106 99991231(最新)
#T-1每日新增及变化
A a4 20220110(变化)
B b3 20220110(变化)
C c1 20220110(新增)
#T-2存量快照(需从拉链表获取,以避免T-2有的数据在T-1的时候已经被删掉了)
A a1 20220109 99991231(最新)
B b2 20220106 99991231(最新)
NOTE:如果从快照T-2中取,加入T-1是20220110,那么T-2只能取到:A a1 20220109,而B b3 20220110会被视为新增,最终导致拉链表会重复。
B b2 20220106 99991231(最新)
#需求:合并后的结果
id name dw_start_date dw_end_date
A a1 20220101 20220105
A a2 20220105 20220106
A a3 20220106 20220108
A a2 20220108 20220109
A a1 20220109 20220110(修改)
B b1 20220102 20220106
B b2 20220106 20220110(修改)
----------------------------------
A a4 20220110 99991231(新增)
B b3 20220110 99991231(新增)
C c1 20220110 99991231(新增)
2.1 从存量快照获取到T-2的有效信息(主要是为了支持重跑)
-- 获取T-2的数据的时候,需要将时效时间为T-1的数据还原成99991231,如:20220110换成99991231
CREATE TABLE tmp2 AS
select
id,
name,
dw_start_date,
if(dw_end_date = '20220110','99991231',dw_end_date) as dw_end_date --还原当日上次因任务调度被修改的dw_end_date
from test_zip_chain where dw_start_date <= '20220109'
/***
id name dw_start_date dw_end_date id name dw_start_date dw_end_date
A a1 20220101 20220105 A a1 20220101 20220105
A a2 20220105 20220106 A a2 20220105 20220106
A a3 20220106 20220108 A a3 20220106 20220108
A a2 20220108 20220109 => A a2 20220108 20220109
A a1 20220109 20220110(被修改) A a1 20220109 99991231(还原)
B b1 20220102 20220106 B b1 20220102 20220106
B b2 20220106 20220110(被修改) B b2 20220106 99991231(还原)
*/
2.2 从T-1与T-2的拉链有效数据中
获取新增及变化
CREATE TABLE tmp1 AS
SELECT
t3.id,
t3.name,
t2.flag
FROM(
SELECT
t1.id,
if(MAX(old_name) IS NULL,'新增','修改') as flag --打上新增、修改标签
FROM(
--T-1增量数据
SELECT
name as new_name,
id,
null as old_name
FROM test_zip
WHERE a.inc_day = '20220110'
UNION ALL
--T-2存量最新拉链的99991231的有效数据,因为拉链表来源于快照表,当快照的T-1数据准备好后,再与T-2的拉链进行比对筛选出新增及变化
SELECT
null as new_name,
id,
name as old_name
FROM tmp2 where dw_end_date = '99991231' --此时
) t1
GROUP BY id
HAVING NVL(MAX(new_name),'null') <> NVL(MAX(old_name),'null') --找出差异项
) t2
JOIN test_zip t3 --关联取最新的id的全量信息
ON t2.id = t3.id
/****
整体数据变化情况如下:
a4 A null
b3 B null a4 A a1 (修改) A a4 (修改)
c1 C null => b3 B b2 (修改) => B b3 (修改)
null A a1 c1 C null(新增) C c1 (新增)
null B b2
*/
2.3 将增量与T-2的存量快照进行合并
INSERT OVERWRITE TABLE test_zip_chain
--查询新增
SELECT
id,
name,
'20220110' as dw_start_date,
'99991231' as dw_end_date
FROM tmp1
UNION ALL
--修改存量的99991231为T-1日期:20220110
SELECT
id,
name,
dw_start_date,
if(tmp2.dw_end_date = '99991231' and tmp1.id is not null ,'20220110',tmp2.dw_end_date) as dw_end_date
FROM tmp2
left join tmp1
on tmp2.id = tmp1.id
/***
A a4 (修改)
B b3 (修改)
C c1 (新增)
-------
id name dw_start_date dw_end_date
A a1 20220101 20220105
A a2 20220105 20220106
A a3 20220106 20220108
A a2 20220108 20220109
A a1 20220109 99991231(还原)
B b1 20220102 20220106
B b2 20220106 99991231(还原)
-------
合并后结果
id name dw_start_date dw_end_date
A a1 20220101 20220105
A a2 20220105 20220106
A a3 20220106 20220108
A a2 20220108 20220109
A a1 20220109 20220110 (修改存量的时效时间)
B b1 20220102 20220106
B b2 20220106 20220110 (修改存量的时效时间)
A a4 20220110 99991231 (将新增及修改的增量改成从T-1开始生效,时效为99991231)
B b3 20220110 99991231 (将新增及修改的增量改成从T-1开始生效,时效为99991231)
C c1 20220110 99991231 (将新增及修改的增量改成从T-1开始生效,时效为99991231)
*/