hive 生成拉链表&拉链表数据查询&拉链表回滚

生成拉链表思路分析

在这里插入图片描述

ods层

show databases;
create database test_zip;
-- ods
drop table if exists test_zip.ods_user_info;
create table test_zip.ods_user_info
(
    id   string comment 'id',
    name string comment '姓名'
) comment '用户信息表'
    partitioned by (load_date string comment '分区字段');

-- 装载语句
-- 将以下数据插入 ods_user_info
insert into table test_zip.ods_user_info values ("1", "张三", "2020-06-14");
insert into table test_zip.ods_user_info values ("2", "李四", "2020-06-14");
insert into table test_zip.ods_user_info values ("3", "王五", "2020-06-14");
insert into table test_zip.ods_user_info values ("4", "赵六", "2020-06-14");
insert into table test_zip.ods_user_info values ("5", "小明", "2020-06-14");
insert into table test_zip.ods_user_info values ("6", "小华", "2020-06-14");
insert into table test_zip.ods_user_info values ("7", "小强", "2020-06-14");
-- 第二天的新增和变化数据
insert into table test_zip.ods_user_info values ("6", "大华", "2020-06-15");
insert into table test_zip.ods_user_info values ("7", "大强", "2020-06-15");
insert into table test_zip.ods_user_info values ("8", "马克", "2020-06-15");
insert into table test_zip.ods_user_info values ("9", "威廉", "2020-06-15");

这里直接给 ods 层插入了4天的数据

在这里插入图片描述

dim层 初始化拉链表

drop table if exists dim_user_info_zip;
create table if not exists dim_user_info_zip
(
    id         string comment 'id',
    name       string comment '姓名',
    start_date string comment '开始日期',
    end_date   string comment '日期'
) comment '用户拉链表' partitioned by (dt string comment '分区字段');

首日装载

-- 首日装载(全量数据载) 将 2020-06-14 作为数仓上线第一天
insert overwrite table dim_user_info_zip partition (dt)
select id,
       name,
       '2020-06-14' as start_date,
       '9999-12-31' as end_date,
       '9999-12-31' as dt
from ods_user_info
where dt = '2020-06-14';

select * from dim_user_info_zip;

在这里插入图片描述

每日装载(15号)

-- todo 2020-06-15
-- 每日装载
-- 每日装载 获取第二天新增和变化的数据
with tmp as (
            select old.id         as old_id,
                   old.name       as old_name,
                   old.start_date as old_start_date,
                   old.end_date   as old_end_date,
                   new.id         as new_id,
                   new.name       as new_name,
                   new.start_date as new_start_date,
                   new.end_date   as new_end_date
            from (
                 select id,
                        name,
                        start_date,
                        end_date
                 from dim_user_info_zip
                 where dt = '9999-12-31'
                 ) old
                     full outer join
                 ( -- 获取第二天新增和变化的数据
                 select id,
                        name,
                        '2020-06-15' as start_date,
                        '9999-12-31' as end_date
                 from ods_user_info
                 where dt = '2020-06-15'
                 ) new
                 on old.id = new.id
            )
insert overwrite table dim_user_info_zip partition (dt)
select if(new_id is not null, new_id, old_id)                 id,
       if(new_id is not null, new_name, old_name)             name,
       if(new_id is not null, new_start_date, old_start_date) start_date,
       if(new_id is not null, new_end_date, old_end_date)     end_date,
       if(new_id is not null, new_end_date, old_end_date)     dt
from tmp
union all
select old_id                                    as id,
       old_name                                  as name,
       old_start_date                            as start_date,
       cast(date_sub('2020-06-15', 1) as string) as end_date,
       cast(date_sub('2020-06-15', 1) as string) as dt
from tmp
where old_id is not null
  and new_id is not null;
select * from dim_user_info_zip;

在这里插入图片描述

select id,
       name,
       start_date,
       '9999-12-31' as end_date,
       '9999-12-31' as dt
from dim_user_info_zip
where end_date >= '2020-06-15'
  and start_date <= '2020-06-15';

-- 15号也是当天最新数据
select *
from dim_user_info_zip
where end_date='9999-12-31';

在这里插入图片描述

每日装载(16号)

-- todo 2020-06-16
-- 每日装载
-- 每日装载 获取第二天新增和变化的数据
with tmp as (
            select old.id         as old_id,
                   old.name       as old_name,
                   old.start_date as old_start_date,
                   old.end_date   as old_end_date,
                   new.id         as new_id,
                   new.name       as new_name,
                   new.start_date as new_start_date,
                   new.end_date   as new_end_date
            from (
                 select id,
                        name,
                        start_date,
                        end_date
                 from dim_user_info_zip
                 where dt = '9999-12-31'
                 ) old
                     full outer join
                 ( -- 获取第二天新增和变化的数据
                 select id,
                        name,
                        '2020-06-16' as start_date,
                        '9999-12-31' as end_date
                 from ods_user_info
                 where dt = '2020-06-16'
                 ) new
                 on old.id = new.id
            )
insert overwrite table dim_user_info_zip partition (dt)
select if(new_id is not null, new_id, old_id)                 id,
       if(new_id is not null, new_name, old_name)             name,
       if(new_id is not null, new_start_date, old_start_date) start_date,
       if(new_id is not null, new_end_date, old_end_date)     end_date,
       if(new_id is not null, new_end_date, old_end_date)     dt
from tmp
union all
select old_id                                    as id,
       old_name                                  as name,
       old_start_date                            as start_date,
       cast(date_sub('2020-06-16', 1) as string) as end_date,
       cast(date_sub('2020-06-16', 1) as string) as dt
from tmp
where old_id is not null
  and new_id is not null;
select * from dim_user_info_zip;

在这里插入图片描述

select id,
       name,
       start_date,
       '9999-12-31' as end_date,
       '9999-12-31' as dt
from dim_user_info_zip
where end_date >= '2020-06-16'
  and start_date <= '2020-06-16';

-- 16号也是当天最新数据
select *
from dim_user_info_zip
where end_date='9999-12-31';

在这里插入图片描述

每日装载(17号)

-- todo 2020-06-17
with tmp as (
            select old.id         as old_id,
                   old.name       as old_name,
                   old.start_date as old_start_date,
                   old.end_date   as old_end_date,
                   new.id         as new_id,
                   new.name       as new_name,
                   new.start_date as new_start_date,
                   new.end_date   as new_end_date
            from (
                 select id,
                        name,
                        start_date,
                        end_date
                 from dim_user_info_zip
                 where dt = '9999-12-31'
                 ) old
                     full outer join
                 ( -- 获取第二天新增和变化的数据
                 select id,
                        name,
                        '2020-06-17' as start_date,
                        '9999-12-31' as end_date
                 from ods_user_info
                 where dt = '2020-06-17'
                 ) new
                 on old.id = new.id
            )
insert overwrite table dim_user_info_zip partition (dt)
select if(new_id is not null, new_id, old_id)                 id,
       if(new_id is not null, new_name, old_name)             name,
       if(new_id is not null, new_start_date, old_start_date) start_date,
       if(new_id is not null, new_end_date, old_end_date)     end_date,
       if(new_id is not null, new_end_date, old_end_date)     dt
from tmp
union all
select old_id                                    as id,
       old_name                                  as name,
       old_start_date                            as start_date,
       cast(date_sub('2020-06-17', 1) as string) as end_date,
       cast(date_sub('2020-06-17', 1) as string) as dt
from tmp
where old_id is not null
  and new_id is not null;
select * from dim_user_info_zip;

在这里插入图片描述

select id,
       name,
       start_date,
       '9999-12-31' as end_date,
       '9999-12-31' as dt
from dim_user_info_zip
where end_date >= '2020-06-16'
  and start_date <= '2020-06-16';

-- 17号也是当天最新数据
select *
from dim_user_info_zip
where end_date='9999-12-31';

在这里插入图片描述

拉链表回滚

假设2020-06-16号的数据出错了,要进行回滚,要回滚到2020-06-15,然后重新跑16号、17号的数据

1、重修跑16、17号数据,前提是,16号、17号的数据已经修复正确
2、注意备份数据

-- 拉链表回滚

-- 1、回滚到2020-06-15
insert overwrite table dim_user_info_zip partition (dt)
select id,
       name,
       start_date,
       '9999-12-31' as end_date, -- 将15号,闭链的数据,重新开链,即《威廉》
       '9999-12-31' as dt -- 最新分区
from dim_user_info_zip
where end_date >= '2020-06-15'
  and start_date <= '2020-06-15';

-- 2、重新跑16号、17号的数据
-- 2.1 跑一下 每日装载(16号)
-- 2.2 跑一下 每日装载(17号)
Hive是一个基于Hadoop的数据仓库架构​​,可以用于处理大数据量的结构化和半结构化数据拉链表是一种常用于历史数据关联分析的技术。下面我将简要介绍如何在Hive中使用拉链表来关联历史数据拉链表是一种用于处理数据随时间变动的关联关系的表设计方法。它通常包含两个重要的字段:生效日期和失效日期。生效日期表示了某个数据的开始日期,失效日期表示了该数据的结束日期。 在Hive中,我们可以使用窗口函数和联接操作来实现拉链表的历史数据关联。首先,我们需要在数据表中添加生效日期和失效日期字段,并将历史数据按照日期排序。然后,我们可以使用窗口函数来为每一行数据分配一个唯一的行号,并根据行号对历史数据进行版本化。接下来,我们可以使用联接操作将历史数据和当前数据关联起来,以便进行历史数据查询和分析。 例如,我们有一个学生表student,其中包含学生的信息,还有一个历史变动表student_history,用于记录学生信息的变动。我们可以使用拉链表的方式将学生表和学生历史变动表进行关联。 首先,我们可以在student表和student_history表中添加生效日期和失效日期字段,并将数据按照日期排序。然后,我们使用窗口函数为每一行数据分配一个唯一的行号,并对学生历史变动表进行版本化。 接下来,我们可以使用联接操作将学生表和学生历史变动表进行关联,以便查询某个时间点学生的信息以及变动情况。例如,我们可以查询某个日期范围内学生的姓名、年龄以及变动情况。 在Hive中,使用拉链表来关联历史数据可以提供更加全面和准确的历史数据分析。通过使用窗口函数和联接操作,我们可以轻松地实现拉链表的历史数据关联。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值