文章目录
拉链表的实现需要借助一个分区表来实现,分区表的每个分区对应的是每天新增和变化的数据,注意每个分区只是新增和变化的数据,而拉链表本身可以是普通表也可以是分区表,拉链表中存的就是截止到当前日期最全的一版数据,其实就是拉链表=修改后的历史数据+新增数据
1. 初始化数据
把 2020-06-20 这一天的数据同步到分区表和拉链表中,以此作为历史数据,
以后逐天递增
1.1 建表
- 用户信息表
DROP TABLE IF EXISTS test.userinfo;
CREATE TABLE test.userinfo(
userid STRING COMMENT '用户编号',
mobile STRING COMMENT '手机号码',
regdate STRING COMMENT '注册日期')
COMMENT '用户信息'
PARTITIONED BY (dt string)
row format delimited fields terminated by ',';
- 拉链表
DROP TABLE IF EXISTS test.userhis;
CREATE TABLE test.userhis(
userid STRING COMMENT '用户编号',
mobile STRING COMMENT '手机号码',
regdate STRING COMMENT '注册日期',
start_date STRING,
end_date STRING)
COMMENT '用户信息拉链表'
row format delimited fields terminated by ',';
注意:拉链表多了两个字段start_date、end_date
1.2 加载数据
/opt/datas/hive/userinfo0620.dat
001,13551111111,2020-03-01
002,13561111111,2020-04-01
003,13571111111,2020-05-01
004,13581111111,2020-06-01
- 加载到分区表
load data local inpath '/opt/datas/hive/userinfo0620.dat' into table test.userinfo partition(dt ='2020-06-20');
- 加载到拉链表
拉链表由于多了2个字段 start-date,end_date,所以不能直接加载,我们需要从分区表查询数据,然后导入到拉链表中
insert overwrite table test.userhis
select
userid,
mobile,
regdate,
dt as start_date,
'9999-12-31' as end_date
from
test.userinfo
where
dt = '2020-06-20';
1.3 验证同步数据
- 分区表
+------------------+------------------+-------------------+--------------+
| userinfo.userid | userinfo.mobile | userinfo.regdate | userinfo.dt |
+------------------+------------------+-------------------+--------------+
| 001 | 13551111111 | 2020-03-01 | 2020-06-20 |
| 002 | 13561111111 | 2020-04-01 | 2020-06-20 |
| 003 | 13571111111 | 2020-05-01 | 2020-06-20 |
| 004 | 13581111111 | 2020-06-01 | 2020-06-20 |
+------------------+------------------+-------------------+--------------+
- 拉链表
+-----------------+-----------------+------------------+---------------------+-------------------+
| userhis.userid | userhis.mobile | userhis.regdate | userhis.start_date | userhis.end_date |
+-----------------+-----------------+------------------+---------------------+-------------------+
| 001 | 13551111111 | 2020-03-01 | 2020-06-20 | 9999-12-31 |
| 002 | 13561111111 | 2020-04-01 | 2020-06-20 | 9999-12-31 |
| 003 | 13571111111 | 2020-05-01 | 2020-06-20 | 9999-12-31 |
| 004 | 13581111111 | 2020-06-01 | 2020-06-20 | 9999-12-31 |
+-----------------+-----------------+------------------+---------------------+-------------------+
2. 新增2020-06-21分区数据
/opt/datas/hive/userinfo0621.dat
002,13562222222,2020-04-01,2020-06-21
004,13582222222,2020-06-01,2020-06-21
005,13552222222,2020-06-21,2020-06-21
说明:从这里可以看出,2020-06-21这一天新增的2条数据004和005,修改了002这条数据,这也和我们一开始说的每个分区只是对应新增和修改的数据一致。
load data local inpath '/opt/datas/hive/userinfo0621.dat' into table test.userinfo partition(dt = '2020-06-21')
select * from test.userinfo where dt = '2020-06-21';
+------------------+------------------+-------------------+--------------+
| userinfo.userid | userinfo.mobile | userinfo.regdate | userinfo.dt |
+------------------+------------------+-------------------+--------------+
| 002 | 13562222222 | 2020-04-01 | 2020-06-21 |
| 004 | 13582222222 | 2020-06-01 | 2020-06-21 |
| 005 | 13552222222 | 2020-06-21 | 2020-06-21 |
+------------------+------------------+-------------------+--------------+
截止到目前为止,我们把数据准备好了,分区表有2个分区的数据,分别对应2020-06-20和2020-06-21,拉链表中有一个分区的数据,就是2020-06-20分区的数据,由于2020-06-21这天,新增了2条数据005和006以及修改了001这条数据,那么下面我们就分析如何将2020-06-21这天的数据和历史数据做个融合,同步到拉链表中
3. 加载数据到拉链表
我们知道拉链表=新增数据+修改后的历史数据,下面我们就按此进行分析
- 获取新增数据
select
userid,
mobile,
regdate,
dt as start_date,
'9999-12-31' as end_date
from test.userinfo where dt = '2020-06-21';
- 获取历史数据
这里要做一个特别说明:历史数据是当天新增数据和截止前一天的拉链表做的一个join的结果,这个join就是对之前历史数据做一个数据过期的标记
select
t1.userid userid,
t1.mobile mobile,
t1.regdate regdate,
t1.start_date start_date,
case when
t2.userid is not null and t1.end_date='9999-12-31'
then
date_sub('2020-06-21',1)
else
t1.end_date end as end_date
from test.userhis t1
left join
(select * from test.userinfo where dt = '2020-06-21') t2
on t1.userid = t2.userid
结果
+------------+--------------+-------------+-------------+-------------+
| userid | mobile | regdate | start_date | end_date |
+------------+--------------+-------------+-------------+-------------+
| 001 | 13551111111 | 2020-03-01 | 2020-06-20 | 9999-12-31 |
| 002 | 13561111111 | 2020-04-01 | 2020-06-20 | 2020-06-20 |
| 003 | 13571111111 | 2020-05-01 | 2020-06-20 | 9999-12-31 |
| 004 | 13581111111 | 2020-06-01 | 2020-06-20 | 2020-06-20 |
+------------+--------------+-------------+-------------+-------------+
- 合并新增数据和历史数据插入拉链表
insert overwrite table test.userhis
select
t1.userid t1_userid,
t1.mobile mobile,
t1.regdate regdate,
t1.start_date start_date,
case when
t2.userid is not null and t1.end_date='9999-12-31'
then
date_sub('2020-06-21',1)
else
t1.end_date end as end_date
from test.userhis t1
left join
(select * from test.userinfo where dt = '2020-06-21') t2
on t1.userid = t2.userid
union all
select
userid,
mobile,
regdate,
dt as start_date,
'9999-12-31' as end_date
from test.userinfo where dt = '2020-06-21';
结果
+-----------------+-----------------+------------------+---------------------+-------------------+
| userhis.userid | userhis.mobile | userhis.regdate | userhis.start_date | userhis.end_date |
+-----------------+-----------------+------------------+---------------------+-------------------+
| 001 | 13551111111 | 2020-03-01 | 2020-06-20 | 9999-12-31 |
| 002 | 13561111111 | 2020-04-01 | 2020-06-20 | 2020-06-20 |
| 003 | 13571111111 | 2020-05-01 | 2020-06-20 | 9999-12-31 |
| 004 | 13581111111 | 2020-06-01 | 2020-06-20 | 2020-06-20 |
| 002 | 13562222222 | 2020-04-01 | 2020-06-21 | 9999-12-31 |
| 004 | 13582222222 | 2020-06-01 | 2020-06-21 | 9999-12-31 |
| 005 | 13552222222 | 2020-06-21 | 2020-06-21 | 9999-12-31 |
+-----------------+-----------------+------------------+---------------------+-------------------+
可以看到所谓拉链表就是修改后的历史数据+新增数据。
我们再增加一天新增数据,然后再同步一次拉链表,看看有没有什么规律
4. 新增2020-06-22分区数据
/opt/datas/hive/userinfo0622.dat
004,13333333333,2020-06-01,2020-06-22
005,13533333333,2020-06-21,2020-06-22
006,13733333333,2020-06-22,2020-06-22
说明:从这里可以看出,2020-06-22这一天新增的1条数据006,
修改了2条数据004和005。
load data local inpath '/opt/datas/hive/userinfo0622.dat' into table test.userinfo partition(dt = '2020-06-22')
+------------------+------------------+-------------------+--------------+
| userinfo.userid | userinfo.mobile | userinfo.regdate | userinfo.dt |
+------------------+------------------+-------------------+--------------+
| 004 | 13333333333 | 2020-06-01 | 2020-06-22 |
| 005 | 13533333333 | 2020-06-21 | 2020-06-22 |
| 006 | 13733333333 | 2020-06-22 | 2020-06-22 |
+------------------+------------------+-------------------+--------------+
5. 再次加载数据到拉链表
- 新增数据
select
userid,
mobile,
regdate,
dt as start_date,
'9999-12-31' as end_date
from test.userinfo where dt = '2020-06-22';
- 历史数据
select
t1.userid userid,
t1.mobile mobile,
t1.regdate regdate,
t1.start_date start_date,
case when
t2.userid is not null and t1.end_date='9999-12-31'
then
date_sub('2020-06-22',1)
else
t1.end_date end as end_date
from test.userhis t1
left join
(select * from test.userinfo where dt = '2020-06-22') t2
on t1.userid = t2.userid;
结果:
+------------+--------------+-------------+-------------+-------------+
| userid | mobile | regdate | start_date | end_date |
+------------+--------------+-------------+-------------+-------------+
| 001 | 13551111111 | 2020-03-01 | 2020-06-20 | 9999-12-31 |
| 002 | 13561111111 | 2020-04-01 | 2020-06-20 | 2020-06-20 |
| 003 | 13571111111 | 2020-05-01 | 2020-06-20 | 9999-12-31 |
| 004 | 13581111111 | 2020-06-01 | 2020-06-20 | 2020-06-20 |
| 002 | 13562222222 | 2020-04-01 | 2020-06-21 | 9999-12-31 |
| 004 | 13582222222 | 2020-06-01 | 2020-06-21 | 2020-06-21 |
| 005 | 13552222222 | 2020-06-21 | 2020-06-21 | 2020-06-21 |
+------------+--------------+-------------+-------------+-------------+
- 合并历史数据和新增数据插入拉链表
insert overwrite table test.userhis
select
t1.userid t1_userid,
t1.mobile mobile,
t1.regdate regdate,
t1.start_date start_date,
case when
t2.userid is not null and t1.end_date='9999-12-31'
then
date_sub('2020-06-22',1)
else
t1.end_date end as end_date
from test.userhis t1
left join
(select * from test.userinfo where dt = '2020-06-22') t2
on t1.userid = t2.userid
union all
select
userid,
mobile,
regdate,
dt as start_date,
'9999-12-31' as end_date
from test.userinfo where dt = '2020-06-22';
结果:
select * from test.userhis;
+-----------------+-----------------+------------------+---------------------+-------------------+
| userhis.userid | userhis.mobile | userhis.regdate | userhis.start_date | userhis.end_date |
+-----------------+-----------------+------------------+---------------------+-------------------+
| 001 | 13551111111 | 2020-03-01 | 2020-06-20 | 9999-12-31 |
| 002 | 13561111111 | 2020-04-01 | 2020-06-20 | 2020-06-20 |
| 003 | 13571111111 | 2020-05-01 | 2020-06-20 | 9999-12-31 |
| 004 | 13581111111 | 2020-06-01 | 2020-06-20 | 2020-06-20 |
| 002 | 13562222222 | 2020-04-01 | 2020-06-21 | 9999-12-31 |
| 004 | 13582222222 | 2020-06-01 | 2020-06-21 | 2020-06-21 |
| 005 | 13552222222 | 2020-06-21 | 2020-06-21 | 2020-06-21 |
| 004 | 13333333333 | 2020-06-01 | 2020-06-22 | 9999-12-31 |
| 005 | 13533333333 | 2020-06-21 | 2020-06-22 | 9999-12-31 |
| 006 | 13733333333 | 2020-06-22 | 2020-06-22 | 9999-12-31 |
+-----------------+-----------------+------------------+---------------------+-------------------+
可以看到这个代码是通用的,下面2020-06-23…24…25 都是按照这样写入拉链表,那么我们可以写一个写入拉链表的脚本
#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table test.userhis
select
t1.userid t1_userid,
t1.mobile mobile,
t1.regdate regdate,
t1.start_date start_date,
case when
t2.userid is not null and t1.end_date='9999-12-31'
then
date_sub('$do_date',1)
else
t1.end_date end as end_date
from test.userhis t1
left join
(select * from test.userinfo where dt = '$do_date') t2
on t1.userid = t2.userid
union all
select
userid,
mobile,
regdate,
dt as start_date,
'9999-12-31' as end_date
from test.userinfo where dt = '$do_date';"
hive -e "$sql"
6. 拉链表的使用
- 查看拉链表中最新数据(2020-06-22以后的数据)
select * from userhis where end_date='9999-12-31' order by userid;
结果:
+-----------------+-----------------+------------------+---------------------+-------------------+
| userhis.userid | userhis.mobile | userhis.regdate | userhis.start_date | userhis.end_date |
+-----------------+-----------------+------------------+---------------------+-------------------+
| 001 | 13551111111 | 2020-03-01 | 2020-06-20 | 9999-12-31 |
| 002 | 13562222222 | 2020-04-01 | 2020-06-21 | 9999-12-31 |
| 003 | 13571111111 | 2020-05-01 | 2020-06-20 | 9999-12-31 |
| 004 | 13333333333 | 2020-06-01 | 2020-06-22 | 9999-12-31 |
| 005 | 13533333333 | 2020-06-21 | 2020-06-22 | 9999-12-31 |
| 006 | 13733333333 | 2020-06-22 | 2020-06-22 | 9999-12-31 |
+-----------------+-----------------+------------------+---------------------+-------------------+
- 查看拉链表中给定日期数据(“2020-06-21”)
select * from userhis where start_date <= '2020-06-21' and end_date >= '2020-06-21' order by userid;
之所以要 end_date >= ‘2020-06-21’,是要这条数据在2020-06-21还没有过期
结果:
+-----------------+-----------------+------------------+---------------------+-------------------+
| userhis.userid | userhis.mobile | userhis.regdate | userhis.start_date | userhis.end_date |
+-----------------+-----------------+------------------+---------------------+-------------------+
| 001 | 13551111111 | 2020-03-01 | 2020-06-20 | 9999-12-31 |
| 002 | 13562222222 | 2020-04-01 | 2020-06-21 | 9999-12-31 |
| 003 | 13571111111 | 2020-05-01 | 2020-06-20 | 9999-12-31 |
| 004 | 13582222222 | 2020-06-01 | 2020-06-21 | 2020-06-21 |
| 005 | 13552222222 | 2020-06-21 | 2020-06-21 | 2020-06-21 |
+-----------------+-----------------+------------------+---------------------+-------------------+