hive 拉链表_hive拉链表的优缺点-CSDN博客

本文链接：https://blog.csdn.net/Mogeko1/article/details/129089477

文章介绍了如何针对90%数据不常变的订单表，采用拉链表模型来节省存储空间，同时保持每天数据状态。相比每天保存全量表的方案，拉链表在查询时稍显复杂，但能有效减少冗余。示例展示了从原始数据到拉链表的转换过程，以及在Hive中的实现和查询方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

拉链表的概念

以订单表为例，表中90%的数据基本不会随着时间而变化，只有最近一段时间内的数据会有变化
对于这种类型的表，我们往往需要保存好每一条数据的每一天的状态

方案一：可以每天保存一份全量表，并长期存储，这样可以实现每天状态的保存，
也方便查询任何一天中数据的状态；弊端：由于表中90%的数据都不会变化，因此，
各天的全量表，其实大量数据都是相同的，存储冗余度太高
方案二：使用拉链表模型，来实现每条数据每天状态的变化情况。
优点：既能保留每天状态，又比较节省存储空间；弊端：使用、查询的时候，略增加了一点复杂性

实例：

2.8日原始数据

linux>vi zip.txt
10001,zhangsan,888,未支付,2023-02-01,2023-02-02
10001,zhangsan,888,已支付,2023-02-03,2023-02-03
10001,zhangsan,888,已发货,2023-02-04,2023-02-07
10001,zhangsan,888,已完成,2023-02-07,9999-12-31
10002,lisi,800,未支付,2023-02-06,2023-02-07
10002,lisi,800,已支付,2023-02-08,9999-12-31

2.9日新增数据

linux>vi incr.txt
10002,lisi,800,已发货
10003,wangwu,888,未支付

变化过程

10001,zhangsan,888,未支付,2023-02-01,9999-12-31	#第一次生成数据，2.1号状态‘未支付’,结束时间为9999-12-31

10001,zhangsan,888,未支付,2023-02-01,2023-02-02	#第二次用户支付 状态发生变化，第一次结束时间闭合，支付时间为2.2号
10001,zhangsan,888,已支付,2023-02-03,9999-12-31	#第二次状态变化，状态已支付完成，结束时间为9999-12-31

hive建表导入数据

hive>create table zip1(
id int,
name string,
money string,
OrderStatus string,
StartTime string,
EndTime string
)
partitioned by (dt string)
row format delimited fields terminated by ',';
load data local inpath '/root/tmp_data/test/zip.txt' into table zip1 partition(dt='2023-02-08');



hive>create table incr1(
id int,
name string,
money string,
OrderStatus string
)
partitioned by (dt string)
row format delimited fields terminated by ',';
load data local inpath '/root/tmp_data/test/incr.txt' into table incr1 partition(dt='2023-02-09');

hive>create table MergeZipper(
id int,
name string,
money string,
OrderStatus string,
StartTime string,
EndTime string
)
partitioned by (dt string)
row format delimited fields terminated by ',';

计算拉链并写入到MergeZipper表中

hive>with zips as(
select
id,
name,
money,
OrderStatus,
StartTime,
EndTime
from zip1 where dt='2023-02-08'
),
incrs as (
select
id,
name,
money,
OrderStatus
from incr1 where dt='2023-02-09'
)
INSERT into table MergeZipper   partition(dt='2023-02-09')
select
z.id,
z.name,
z.money,
z.OrderStatus,
z.StartTime,
if( z.EndTime='9999-12-31' and i.id is not null,'2023-02-08',z.EndTime) as EndTime
from zips z left join incrs i
on (z.id=i.id)
union all
select
i.id,i.name,i.money,i.OrderStatus,'2023-02-09' as StartTime,'9999-12-31' as EndTime
from incrs i;

或者

hive>with zips as(
select
id,
name,
money,
OrderStatus,
StartTime,
EndTime
from zip1 where dt='2023-02-08'
),
incrs as (
select
id,
name,
money,
OrderStatus
from incr1 where dt='2023-02-09'
)
INSERT overwrite table MergeZipper   partition(dt='2023-02-09')
select
z.id,
z.name,
z.money,
z.OrderStatus,
z.StartTime,
if( z.EndTime='9999-12-31' and i.id is not null,'2023-02-08',z.EndTime) as EndTime
from zips z left join incrs i
on (z.id=i.id)
union all
select
i.id,i.name,i.money,i.OrderStatus,'2023-02-09' as StartTime,'9999-12-31' as EndTime
from incrs i;

查询拉链表数据

hive>select * from MergeZipper where  StartTime<='2023-02-08' and EndTime>='2023-02-08';

+-----------------+-------------------+--------------------+--------------------------+------------------------+----------------------+-----------------+-
-+| mergezipper.id  | mergezipper.name  | mergezipper.money  | mergezipper.orderstatus  | mergezipper.starttime  | mergezipper.endtime  | mergezipper.dt  |
+-----------------+-------------------+--------------------+--------------------------+------------------------+----------------------+-----------------+-
-+| 10001           | zhangsan          | 888                | 已完成                      | 2023-02-07             | 9999-12-31           | 2023-02-09     
 || 10002           | lisi              | 800                | 已支付                      | 2023-02-08             | 2023-02-08           | 2023-02-09     
 || 10001           | zhangsan          | 888                | 已完成                      | 2023-02-07             | 9999-12-31           | 2023-02-10     
 || 10002           | lisi              | 800                | 已支付                      | 2023-02-08             | 2023-02-08           | 2023-02-10     
 |+-----------------+-------------------+--------------------+--------------------------+------------------------+----------------------+-----------------+-
-+4 rows selected (0.163 seconds)