数仓-拉链表实现

最新推荐文章于 2024-07-10 15:27:46 发布

别这么骄傲

最新推荐文章于 2024-07-10 15:27:46 发布

阅读量187

点赞数 2

分类专栏： hive 文章标签：大数据 hadoop hive

本文链接：https://blog.csdn.net/weixin_43753599/article/details/129605377

版权

hive 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

1、拉链表使用场景（解决数据同步问题）

（1）变化的比例和频率不是很大

总共有1000万的用户，每天新增和发生变化的有10万左右

（2）需要查看某一个时间点或者时间段的历史快照信息

查看某一个用户在历史某一时间点的手机号码
查看某一个用户在过去某一段时间内，更新过几次等等

（3）变化的比例和频率不是很大

总共有1000万的用户，每天新增和发生变化的有10万左右

2、拉链表实现图解

（1）更新数据

在这里插入图片描述

（2）最终数据

  上图更新表
  union all 
  ods层第二天新增用户表数据

===>完整表数据

在这里插入图片描述

3、拉链表实现过程

（1）测试数据准备

！！备注：两个文件的数据需要上传到 虚拟机或者hdfs

zipper.txt 文件

001	186xxxx1234	laoda	0	sh	2021-01-01	9999-12-31
002	186xxxx1235	laoer	1	bj	2021-01-01	9999-12-31
003	186xxxx1236	laosan	0	sz	2021-01-01	9999-12-31
004	186xxxx1237	laosi	1	gz	2021-01-01	9999-12-31
005	186xxxx1238	laowu	0	sh	2021-01-01	9999-12-31
006	186xxxx1239	laoliu	1	bj	2021-01-01	9999-12-31
007	186xxxx1240	laoqi	0	sz	2021-01-01	9999-12-31
008	186xxxx1241	laoba	1	gz	2021-01-01	9999-12-31
009	186xxxx1242	laojiu	0	sh	2021-01-01	9999-12-31
010	186xxxx1243	laoshi	1	bj	2021-01-01	9999-12-31


update.txt 文件


008	186xxxx1241	laoba	1	sh	2021-01-02	9999-12-31
011	186xxxx1244	laoshi	1	jx	2021-01-02	9999-12-31
012	186xxxx1245	laoshi	0	zj	2021-01-02	9999-12-31

（2）创建表和导入数据

--把zipper.txt 、update.txt两个文件放到虚拟机的/root/hivedata目录

--Step1：创建dwd层的 拉链表
--模拟导入第一天全量数据
create table dw_zipper(
                          userid string,
                          phone string,
                          nick string,
                          gender int,
                          addr string,
                          starttime string,
                          endtime string
) row format delimited fields terminated by '\t';

--加载模拟数据
load data local inpath '/root/hivedata/zipper.txt' into table dw_zipper;
--使用put也可以
hadoop fs -put zipper.txt /user/hive/warehouse/test.db/dw_zipper

select userid,nick,addr,starttime,endtime from dw_zipper;

在这里插入图片描述

--Step2：模拟导入第二天 增量数据采集

create table ods_zipper_update(
                                  userid string,
                                  phone string,
                                  nick string,
                                  gender int,
                                  addr string,
                                  starttime string,
                                  endtime string
) row format delimited fields terminated by '\t';

load data local inpath '/root/hivedata/update.txt' into table ods_zipper_update;
--使用put也可以
hadoop fs -put update.txt /user/hive/warehouse/test.db/ods_zipper_update

select * from ods_zipper_update;

在这里插入图片描述

--Step3：创建临时表
create table tmp_zipper(
                           userid string,
                           phone string,
                           nick string,
                           gender int,
                           addr string,
                           starttime string,
                           endtime string
) row format delimited fields terminated by '\t';


--Step4：合并拉链表与增量表
insert overwrite table tmp_zipper
select
    userid,
    phone,
    nick,
    gender,
    addr,
    starttime,
    endtime
from ods_zipper_update
union all
--查询原来拉链表的所有数据，并将这次需要更新的数据的endTime更改为更新值的startTime
select
    a.userid,
    a.phone,
    a.nick,
    a.gender,
    a.addr,
    a.starttime,
    --如果这条数据没有更新或者这条数据不是要更改的数据，就保留原来的值，否则就改为新数据的开始时间-1
    if(b.userid is null or a.endtime < '9999-12-31', a.endtime , date_sub(b.starttime,1)) as endtime
from dw_zipper a  left join ods_zipper_update b
                            on a.userid = b.userid ;


--Step5：覆盖拉链表
insert overwrite table dw_zipper
select * from tmp_zipper;