拉链表
1.解决的痛点
场景举例:在构建数仓时,维度表基本上都是全量(比如:省份表,商品表,活动表等)每天全量然后写入到dim层表的日期分区中(以天为单位快照一份,写入到天为单位的分区中)(这样做的目的是简单,而且后期可以查找以前的数据)数据量小,占用的空间小,但是 如果是某网站,用户量很多,用户表很大,以快照的方式存入分区中就不太合适,因为数据量大,太占用空间,拉链表能解决该问题。(总结:解决了,维度表数据量大,浪费磁盘的问题)
2.适用场景
用于处理维度数据变化的历史记录,通常用于解决“慢变化维度(Slowly Changing Dimension,SCD)”的情况,提供了一种有效管理和查询维度数据变化的方式。(总结:基础数据量大,变化量小的情况)
3.拉链表定义
拉链表,记录每条信息的生命周期,一旦一条记录的生命周期结束,就重新开始一条新的记录,并把当前日期放入生效开始日期。
如果当前信息至今有效,在生效结束日期中填入一个极大值(如9999-12-31)
用户id | 姓名 | 手机 | 开始时间 | 结束时间 |
---|---|---|---|---|
1 | 张三 | 1350987 | 2024-01-01 | 2024-01-01 |
1 | 张三 | 1350988 | 2024-01-02 | 2024-01-09 |
1 | 张三 | 1350989 | 2024-01-09 | 9999-12-31 |
4.建表
建表:业务字段+开始时间+结束时间 (开始时间和结束时间一定要有的)
drop table if exists dim_user_zip; create external table dim_user_zip( `id` STRING COMMENT '用户id', `login_name` STRING COMMENT '用户名称', `nick_name` STRING COMMENT '用户昵称', `name` STRING COMMENT '用户姓名', `phone_num` STRING COMMENT '手机号码', `email` STRING COMMENT '邮箱', `user_level` STRING COMMENT '用户等级', `birthday` STRING COMMENT '生日', `gender` STRING COMMENT '性别', `create_time` STRING COMMENT '创建时间', `operate_time` STRING COMMENT '操作时间', `start_date` STRING COMMENT '开始日期', `end_date` STRING COMMENT '结束日期' ) comment '用户表' partitioned by (dt string) stored as orc location '/warehouse/gmall/dim/dim_user_zip/' tblproperties ('orc.compress' = 'snappy');
5.首日装载
注意:1.筛选日期 (where dt = '2024-04-08')
2.筛选类型(必须是bootstrap-insert(maxwell采集))(where dt = '2024-04-08' and type='bootstrap-insert')
insert overwrite table dim_user_zip partition(dt = '9999-12-31') select data.`id`, data.`login_name`, data.`nick_name`, data.`name`, data.`phone_num`, data.`email`, data.`user_level`, data.`birthday`, data.`gender`, data.`create_time`, data.`operate_time`, '2024-04-08' start_date, '9999-12-31' end_date from ods_user_info_inc where dt = '2024-04-08' and type='bootstrap-insert'
6.每日装载
方法一:
思路:union all
1.筛选出ods中昨天更新的最后一条数据(打标记筛选)
row_nuber() over(partition by data ,id order by ts desc)
2.dim中的表(dim_user_zip)与ods层中的表(ads_user_info_inc) union all
3.再次打标记
row_nuber()over(partition by id order by start desc)
4.确定开始时间和结束时间
开始时间:start_date
结束时间:(第3步)如果标记为2,:昨天-1天,如果是1:'9999-12-31'。(if(pai_num=2,date_sub('2024-04-08',1),'9999-12-31'))
5.分区
分区:(第3步)如果标记为2,:昨天-1天分区,如果是1:'9999-12-31'分区。(if(pai_num=2,date_sub('2024-04-08',1),'9999-12-31'))
6.案例:
insert overwrite table dim_user_zip partition (dt) select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, if(pai_num=2,date_sub('2024-04-08',1),'9999-12-31'), if(pai_num=2,date_sub('2024-04-08',1),'9999-12-31') from ( select *, row_number() over (partition by id order by start_date desc ) pai_num from ( select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date from dim_user_zip where dt = '9999-12-31' union all select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date from ( select data.`id`, data.`login_name`, data.`nick_name`, data.`name`, data.`phone_num`, data.`email`, data.`user_level`, data.`birthday`, data.`gender`, data.`create_time`, data.`operate_time`, '2024-04-08' start_date, '9999-12-31' end_date, row_number() over (partition by data.id order by data.create_time desc) pai from ods_user_info_inc where dt = '2024-04-08' ) t1 where pai = 1 ) t2 ) t3;
方法二:
思路:full join
1.先筛选出ods中昨天更新的最后一条数据(打标记筛选)
row_nuber() over (partition by data ,id order by ts desc)
2.ods和zip表full join 得到emp表
3.判断 emp表(new_id is not null,new _id,old_id) 得到t1表(放入9999-12-31分区)
4.过滤 old_id not null and new_id is not null 得到t2表(放入到昨天-1天分区)
5.t1 union all t2 然后分区
set hive.exec.dynamic.partition.mode=nonstrict; with tmp as ( select old.id old_id, old.login_name old_login_name, old.nick_name old_nick_name, old.name old_name, old.phone_num old_phone_num, old.email old_email, old.user_level old_user_level, old.birthday old_birthday, old.gender old_gender, old.create_time old_create_time, old.operate_time old_operate_time, old.start_date old_start_date, old.end_date old_end_date, new.id new_id, new.login_name new_login_name, new.nick_name new_nick_name, new.name new_name, new.phone_num new_phone_num, new.email new_email, new.user_level new_user_level, new.birthday new_birthday, new.gender new_gender, new.create_time new_create_time, new.operate_time new_operate_time, new.start_date new_start_date, new.end_date new_end_date from ( select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date from dim_user_zip where dt='9999-12-31' )old full outer join ( select id, login_name, nick_name, md5(name) name, md5(phone_num) phone_num, md5(email) email, user_level, birthday, gender, create_time, operate_time, '2022-06-09' start_date, '9999-12-31' end_date from ( select data.id, data.login_name, data.nick_name, data.name, data.phone_num, data.email, data.user_level, data.birthday, data.gender, data.create_time, data.operate_time, row_number() over (partition by data.id order by ts desc) rn from ods_user_info_inc where dt='2022-06-09' )t1 where rn=1 )new on old.id=new.id ) insert overwrite table dim_user_zip partition(dt) select if(new_id is not null,new_id,old_id), if(new_id is not null,new_login_name,old_login_name), if(new_id is not null,new_nick_name,old_nick_name), if(new_id is not null,new_name,old_name), if(new_id is not null,new_phone_num,old_phone_num), if(new_id is not null,new_email,old_email), if(new_id is not null,new_user_level,old_user_level), if(new_id is not null,new_birthday,old_birthday), if(new_id is not null,new_gender,old_gender), if(new_id is not null,new_create_time,old_create_time), if(new_id is not null,new_operate_time,old_operate_time), if(new_id is not null,new_start_date,old_start_date), if(new_id is not null,new_end_date,old_end_date), if(new_id is not null,new_end_date,old_end_date) dt from tmp union all select old_id, old_login_name, old_nick_name, old_name, old_phone_num, old_email, old_user_level, old_birthday, old_gender, old_create_time, old_operate_time, old_start_date, cast(date_add('2022-06-09',-1) as string) old_end_date, cast(date_add('2022-06-09',-1) as string) dt from tmp where old_id is not null and new_id is not null;
或者
select new.id, new.login_name, new.nick_name, new.name, new.phone_num, new.email, new.user_level, new.birthday, new.gender, new.create_time, new.operate_time, new.start_date, cast ('9999-12-31' as date) end_date from ( select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date from ( select data.`id`, data.`login_name`, data.`nick_name`, data.`name`, data.`phone_num`, data.`email`, data.`user_level`, data.`birthday`, data.`gender`, data.`create_time`, data.`operate_time`, '2024-04-08' start_date, '9999-12-31' end_date, row_number() over (partition by data.id order by data.create_time desc ) pai from ods_user_info_inc where dt = '2024-04-08' ) t1 where pai = 1 ) new full join ( select * from dim_user_zip where dt = '9999-12-31' ) t3 on new.id = t3.id where new.id is not null or (t3.id is not null and new.id is null) union all ( select t3.id, t3.login_name, t3.nick_name, t3.name, t3.phone_num, t3.email, t3.user_level, t3.birthday, t3.gender, t3.create_time, t3.operate_time, t3.start_date, date_sub(t3.end_date,1) end_date from ( select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date from ( select data.`id`, data.`login_name`, data.`nick_name`, data.`name`, data.`phone_num`, data.`email`, data.`user_level`, data.`birthday`, data.`gender`, data.`create_time`, data.`operate_time`, '2024-04-08' start_date, '9999-12-31' end_date, row_number() over (partition by data.id order by data.create_time desc ) pai from ods_user_info_inc where dt = '2024-04-08' ) t1 where pai = 1 ) new full join ( select * from dim_user_zip where dt = '9999-12-31' ) t3 on new.id = t3.id where t3.id is not null and new.id is not null );
7.查询
查询最新的数据
select * from dim_user_zip where dt='9999-12-31';
查询以前的数据
SELECT * FROM dim_user_zip WHERE start_date <= '查询日期' AND (end_date IS NULL OR end_date > '查询日期')