拉链表的概念
以订单表为例,表中90%的数据基本不会随着时间而变化,只有最近一段时间内的数据会有变化
对于这种类型的表,我们往往需要保存好每一条数据的每一天的状态
方案一:可以每天保存一份全量表,并长期存储,这样可以实现每天状态的保存,
也方便查询任何一天中数据的状态;弊端:由于表中90%的数据都不会变化,因此,
各天的全量表,其实大量数据都是相同的,存储冗余度太高
方案二:使用拉链表模型,来实现每条数据每天状态的变化情况。
优点:既能保留每天状态,又比较节省存储空间;弊端:使用、查询的时候,略增加了一点复杂性
实例:
2.8日原始数据
linux>vi zip.txt
10001,zhangsan,888,未支付,2023-02-01,2023-02-02
10001,zhangsan,888,已支付,2023-02-03,2023-02-03
10001,zhangsan,888,已发货,2023-02-04,2023-02-07
10001,zhangsan,888,已完成,2023-02-07,9999-12-31
10002,lisi,800,未支付,2023-02-06,2023-02-07
10002,lisi,800,已支付,2023-02-08,9999-12-31
2.9日新增数据
linux>vi incr.txt
10002,lisi,800,已发货
10003,wangwu,888,未支付
变化过程
10001,zhangsan,888,未支付,2023-02-01,9999-12-31 #第一次生成数据,2.1号状态‘未支付’,结束时间为9999-12-31
10001,zhangsan,888,未支付,2023-02-01,2023-02-02 #第二次用户支付 状态发生变化,第一次结束时间闭合,支付时间为2.2号
10001,zhangsan,888,已支付,2023-02-03,9999-12-31 #第二次状态变化,状态已支付完成,结束时间为9999-12-31
hive建表导入数据
hive>create table zip1(
id int,
name string,
money string,
OrderStatus string,
StartTime string,
EndTime string
)
partitioned by (dt string)
row format delimited fields terminated by ',';
load data local inpath '/root/tmp_data/test/zip.txt' into table zip1 partition(dt='2023-02-08');
hive>create table incr1(
id int,
name string,
money string,
OrderStatus string
)
partitioned by (dt string)
row format delimited fields terminated by ',';
load data local inpath '/root/tmp_data/test/incr.txt' into table incr1 partition(dt='2023-02-09');
hive>create table MergeZipper(
id int,
name string,
money string,
OrderStatus string,
StartTime string,
EndTime string
)
partitioned by (dt string)
row format delimited fields terminated by ',';
计算拉链并写入到MergeZipper表中
hive>with zips as(
select
id,
name,
money,
OrderStatus,
StartTime,
EndTime
from zip1 where dt='2023-02-08'
),
incrs as (
select
id,
name,
money,
OrderStatus
from incr1 where dt='2023-02-09'
)
INSERT into table MergeZipper partition(dt='2023-02-09')
select
z.id,
z.name,
z.money,
z.OrderStatus,
z.StartTime,
if( z.EndTime='9999-12-31' and i.id is not null,'2023-02-08',z.EndTime) as EndTime
from zips z left join incrs i
on (z.id=i.id)
union all
select
i.id,i.name,i.money,i.OrderStatus,'2023-02-09' as StartTime,'9999-12-31' as EndTime
from incrs i;
或者
hive>with zips as(
select
id,
name,
money,
OrderStatus,
StartTime,
EndTime
from zip1 where dt='2023-02-08'
),
incrs as (
select
id,
name,
money,
OrderStatus
from incr1 where dt='2023-02-09'
)
INSERT overwrite table MergeZipper partition(dt='2023-02-09')
select
z.id,
z.name,
z.money,
z.OrderStatus,
z.StartTime,
if( z.EndTime='9999-12-31' and i.id is not null,'2023-02-08',z.EndTime) as EndTime
from zips z left join incrs i
on (z.id=i.id)
union all
select
i.id,i.name,i.money,i.OrderStatus,'2023-02-09' as StartTime,'9999-12-31' as EndTime
from incrs i;
查询拉链表数据
hive>select * from MergeZipper where StartTime<='2023-02-08' and EndTime>='2023-02-08';
+-----------------+-------------------+--------------------+--------------------------+------------------------+----------------------+-----------------+-
-+| mergezipper.id | mergezipper.name | mergezipper.money | mergezipper.orderstatus | mergezipper.starttime | mergezipper.endtime | mergezipper.dt |
+-----------------+-------------------+--------------------+--------------------------+------------------------+----------------------+-----------------+-
-+| 10001 | zhangsan | 888 | 已完成 | 2023-02-07 | 9999-12-31 | 2023-02-09
|| 10002 | lisi | 800 | 已支付 | 2023-02-08 | 2023-02-08 | 2023-02-09
|| 10001 | zhangsan | 888 | 已完成 | 2023-02-07 | 9999-12-31 | 2023-02-10
|| 10002 | lisi | 800 | 已支付 | 2023-02-08 | 2023-02-08 | 2023-02-10
|+-----------------+-------------------+--------------------+--------------------------+------------------------+----------------------+-----------------+-
-+4 rows selected (0.163 seconds)