hive拉链表实现的四种方式

zuhge

已于 2024-03-13 17:40:23 修改

阅读量629

点赞数 6

文章标签： hive hadoop 数据仓库

于 2024-03-06 17:41:17 首次发布

本文链接：https://blog.csdn.net/zuhge/article/details/136505428

版权

本文详细介绍了在Hive中通过日切片、增加数据生命周期、时间戳方式以及结合两者的优势实现拉链表的方法。特别强调了不同方法的适用场景、优点和潜在问题，如数据补批的复杂性及开发与分析人员的不同需求。

摘要由CSDN通过智能技术生成

hive拉链表实现的四种方式

废话不多说，直接上干货，
一般人会使用的拉链逻辑：
1.日切片方式： 每日一个切片表，存储要求非常大，不建议使用。
2.增加数据生命周期方式： 一般都会采用这种方式，增加生命周期的链路时间h_start_dt, h_end_dt，如下图：
在这里插入图片描述
如果需要查询某一天的数据：where h_start_dt<=20240306 and h_end_dt>20240306 ,使用起来非常方便，如果数据量比较大，h_start_dt, h_end_dt还可以作为分区字段，查询速度也会很快，但是，但是，但是…如果批次异常或者重跑，当日批次的还好，删除当日链路数据-回滚h_end_dt最大时间为29991231，一旦需要补非当时日期的数据，那就痛苦了，比如正常批次日期为20240306，需要修正20240101的数据，那就懵逼了，需要拆链-修数-剔数-合前链-合后链逻辑巨巨复杂，如果从业务上限制这种补批，这种方案还是很优秀的
3.增加时间戳方式 增加一个字段h_update_dt，作为数据变更时间戳，没次数据有变动则插入一条数据，如下图：

在这里插入图片描述
这种方式比第2种方式更简单，不需要关注开链闭链问题，数据补批也简单，如果需要查询20230108 的数据：

select t1.* from t20240306 t1
inner join(select id,max(h_update_dt) as h_update_dt from t20240306 where h_update_dt<=20230108 group by id) t2 
on t1.id=t2.id and t1.h_update_dt=t2.h_update_dt

在这里插入图片描述
这种方法对开发人员来说是友好的，但是对分析人员来说，使用起来就比较麻烦点，需要两两关联才能得到想要结果。

4.以方案3为数据存储，方案2为对外查询 这种方案针对存储要求不是很苛刻的应用场景，可以兼顾方案2和方案3的优点.创建一个增加时间戳的表ods.t20240312_ht，用于记录所有变更的数据，以这个为主表，每天全量更新增加生命周期的拉链表ods.t20240312_hs

------------------创建表和插入测试数据
create table ods.t20240312(id int,name string,mark string,l_batch_date string);
create table ods.t20240312_ht(id int,name string,mark string,l_is_del string,l_batch_date string);
create table ods.t20240312_hs(id int,name string,mark string,l_is_del string,h_start_dt date,h_end_dt date);

insert into ods.t20240312 values(1,'张三1','test1','20240101'),(2,'张三2','test2','20240102'),(3,'张三3','test3','20240103'),(4,'张三4','test4','20240104'),(5,'张三5','test5','20240105'),(6,'张三6','test6','20240106');
insert overwrite table ods.t20240312 values(1,'张三1','test1 01','20240108'),(7,'张三7','test7','20240110'),(4,'张三4','test4','20240104'),(5,'张三5','test5','20240105'),(6,'张三6','test6','20240106');

--------------------------执行ods层执行包含时间戳表的文件sql
---获取前一天数据
drop table if exists ods.t_t20240312_ht_01;

create table ods.t_t20240312_ht_01 as
select t1.* from ods.t20240312_ht t1
inner join(select id,max(l_batch_date) as l_batch_date from ods.t20240312_ht where l_batch_date<'20240109' group by id) t2 
on t1.id=t2.id and t1.l_batch_date=t2.l_batch_date and t1.l_is_del<>'1';

---获取最新的拉链表
drop table if exists ods.t_t20240312_ht_02;
create table ods.t_t20240312_ht_02 as 
select t1.* from ods.t20240312_ht t1 where t1.l_batch_date<>'20240109'
union all
select nvl(t1.id,t2.id),nvl(t1.name,t2.name),nvl(t1.mark,t2.mark),case when t1.id is null then '1' else '0' end as l_is_del,'20240109' as l_batch_date from ods.t20240312 t1
full join ods.t_t20240312_ht_01 t2 on t1.id=t2.id and t1.l_batch_date<='20240109' 
where  t2.id is null 
or t1.id is null
or t1.name<>t2.name
or t1.mark<>t2.mark
;

---替换更新拉链表
insert overwrite table ods.t20240312_ht
select *from ods.t_t20240312_ht_02;

--------------------------执行ods层执行包含生命周期表的文件sql
insert overwrite table ods.t20240312_hs
select 
b.id,b.name,b.mark,b.l_is_del 
, to_date(from_unixtime(unix_timestamp(cast(b.l_batch_date as string), 'yyyyMMdd')))   as h_start_dt
,case when b.l_is_del='1' then to_date(from_unixtime(unix_timestamp(cast(b.l_batch_date as string), 'yyyyMMdd'))) when b.lead_end_dt is null then to_date(from_unixtime(unix_timestamp('29991231', 'yyyyMMdd'))) else to_date(from_unixtime(unix_timestamp(cast(b.lead_end_dt as string), 'yyyyMMdd'))) end as h_end_dt
 from(
 SELECT a.*, LAG(l_batch_date) OVER (PARTITION BY id ORDER BY l_batch_date ASC) AS lag_start_dt,
 lead(l_batch_date) OVER (PARTITION BY id ORDER BY l_batch_date ASC) AS lead_end_dt
FROM ods.t20240312_ht a
) b;

zuhge

关注

6
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
hive拉链表实现的四种方式

如果批次异常或者重跑，当日批次的还好，删除当日链路数据-回滚h_end_dt最大时间为29991231，一旦需要补非当时日期的数据，那就痛苦了，比如正常批次日期为20240306，需要修正20240101的数据，那就懵逼了，需要拆链-修数-剔数-合前链-合后链逻辑巨巨复杂，如果从业务上限制这种补批，这种方案还是很优秀的。这种方法对开发人员来说是友好的，但是对分析人员来说，使用起来就比较麻烦点，需要两两关联才能得到想要结果。每日一个切片表，存储要求非常大，不建议使用。废话不多说，直接上干货，
复制链接

扫一扫