缓慢变化维之拉链表

       数据仓库设计维度表时涉及缓慢变化维,针对缓慢变化维,根据业务场景,有两种设计:其一覆盖更新,其二新增插入。针对第一种情况可以实行全面覆盖更新,针对第二种情况可以采取每日的全量备份和拉链表。每日全量备份有很多相同的数据信息,对存储空间是极大的浪费,全面覆盖更新就难以查询历史状态,而拉链表就是为了保留历史数据,又节省空间而设计的一种模式。

      另外拉链表的一个优化设计是增加标记最新记录的字段,对于不需要考虑历史数据的场景亦适用。至于为什么不用expire_date来判断最新状态,我举个简单的例子做说明:比如将员工表设计成一个拉链表:那么当一个员工离职时,就会将离职员工最新记录的expire_date更新为离职日期,为了筛选出每个员工的最新记录,我就需要分情况处理,一是expire_date='9999-12-31'的情况,二是expire_date=leave_date的情况,没有增加标记字段来得方便。

  • 拉链表的实现流程

cd0a445dc38a09f6e8ba2cafb5c8c475d00.jpg

  • 拉链表的设计思路

思路一:

1、将ODS的数据全量存储在临时表1中;

2、将维度表的最新数据(通过当前记录标记判断)和临时表1通过业务主键做左关联,在维度表中找出缓慢变化维有变化和不存在于临时表1的业务数据的代理主键,将这部分数据(包含历史维度表的主键)存储在临时表2中;

3、找出临时表1中需要更新(业务主键存在于维度表且标记为维度表当前记录)和新增的数据(业务主键不存在于维度表)插入到维度表中;

4、根据临时表2的代理主键,更新维度表的expire_date为pre_date;

思路二:

1、将ODS的数据全量存储在临时表1中;

2、将维度表的最新数据(通过当前记录标记判断)和临时表1通过业务主键做左关联,在维度表中找出缓慢变化维有变化和不存在于临时表1的业务数据,将这部分数据的expire_date更新为pre_date,并变更为历史记录;

3、找出临时表1中需要更新(每个业务主键的最后一条记录)和新增的数据(业务主键不存在于维度表)插入到维度表中;

思路三:

1、将ODS的数据全量存储在临时表1中;

2、找出临时表1中需要更新和新增的数据插入到维度表中;

3、根据维度表中同一个业务主键,相同expire_date有两条的数据,更新valid_date早的记录的expire_date;

4、在维度表中找出不存在于临时表1的业务主键,更新这些业务主键对应最新记录为历史记录;

  • 拉链表实现方式

第一步都是实现维度表的全量加载:

1、kettle

在kettle中实现了思路三的方式

1.1 第一个转换,更新和新增的数据插入到历史维度表

f3816e364ecc4b040822c6079bdc7257a07.jpg

cebd948bbbda24b713f8649e87bdeb9fbc9.jpg

1.2 第二个转换,根据历史维度表中同一个自然键,相同end_date有两条的数据,更新start_date早的记录的end_date;

ac0619a678eeaeedf3c67f6238299df904d.jpg

2、mysql存储过程

实现思路一的方式

/*
user_id是业务主键,sub_company_id,department_id,team_leader_id是三个需要新增的缓慢变化维
*/
CREATE PROCEDURE `base_user_dim`()
BEGIN
 declare max_date VARCHAR(50) DEFAULT '9999-12-31';
 declare pre_date  VARCHAR(50) DEFAULT DATE_FORMAT(CURDATE()-1,'%Y-%m-%d');

delete from base_user_dim_tmp1;
/*
将有更新的记录在tmp1中,后续更新其expire_date
*/
insert into base_user_dim_tmp1
select a.id,a.user_id from base_user_dim a
left join base_user b on a.user_id=b.user_id 
where a.is_new=1 and (b.user_id is null or a.sub_company_id<>b.sub_company_id or a.department_id<>b.department_id or a.team_leader_id<>b.team_leader_id);
/*  
插入新增(b.user_id is null)和更新的数据(a.sub_company_id<>b.sub_company_id or a.department_id<>b.department_id or a.team_leader_id<>b.team_leader_id)
*/
insert into base_user_dim(user_id,name,sub_company_id,department_id,sub_company_name,
department_name,team_leader_id,team_leader_name,valid_date,expire_date,is_new)
select a.user_id,a.name,a.sub_company_id,a.department_id,a.sub_company_name,
a.department_name,a.team_leader_id,a.team_leader_name,curdate() as valid_date
, case when leave_date is null then max_date else leave_date end as expire_date,a.is_new 
from base_user a left join base_user_dim b
on a.user_id=b.user_id and b.is_new=1
where a.business_role in(1,2) and a.sub_company_id not in(42,730,811) and a.user_id<>1 and a.`status`=1 and 
(b.user_id is null or(a.sub_company_id<>b.sub_company_id or a.department_id<>b.department_id or a.team_leader_id<>b.team_leader_id));

/*
更新tmp1中数据的expire_date
*/
update base_user_dim set expire_date=pre_date,is_new=0 where id in(select id from  base_user_dim_tmp1);

END

 

3、hive

实现思路二的方式

3.1 使用hive事务的方式

<!-- 设置hive支持事务
 如果一个表要实现update和delete功能,该表就必须支持ACID,而支持ACID,就必须满足以下条件:
 1、表的存储格式必须是ORC(STORED AS ORC);
 2、表必须进行分桶(CLUSTERED BY (col_name, col_name, …) INTO num_buckets BUCKETS);
 3、Table property中参数transactional必须设定为True(tblproperties(‘transactional’=‘true’));
-->
<property>
  <name>hive.support.concurrency</name>
  <value>true</value>
</property>
<property>
  <name>hive.exec.dynamic.partition.mode</name>
  <value>nonstrict</value>
</property>
<property>
  <name>hive.txn.manager</name>
  <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
  <name>hive.enforce.bucketing</name>
  <value>true</value>
</property>
<property>
  <name>hive.in.test</name>
  <value>true</value>
</property>
<property>
  <name>hive.compactor.initiator.on</name>
  <value>true</value>
</property>
<property>
  <name>hive.compactor.worker.threads</name>
  <value>1</value>
</property>
-- ---------------------初始化开始----------------------------------------------------------------------------------------------------
-- 如果一个表要实现update和delete功能,该表就必须支持ACID,而支持ACID,就必须满足以下条件:
-- 1、表的存储格式必须是ORC(STORED AS ORC);
-- 2、表必须进行分桶(CLUSTERED BY (col_name, col_name, …) INTO num_buckets BUCKETS);
-- 3、Table property中参数transactional必须设定为True(tblproperties(‘transactional’=‘true’));
-- 先导入到hdfs
create table base_user_dim2(
  `id` int COMMENT '客户ID',
  `user_id` int COMMENT '用户id',
  `name` string COMMENT '用户姓名',
  `sub_company_id` int COMMENT '分公司id',
  `department_id` int COMMENT '部门id',
  `sub_company_name` string COMMENT '所属分公司',
  `department_name` string COMMENT '所属部门',
  `team_leader_id` int COMMENT '团队长ID',
  `team_leader_name` string COMMENT '团队长名字',
  `valid_date` date  COMMENT '有效期',
  `expire_date` date COMMENT '到期日',
  `is_new` int COMMENT '是否最新'
)clustered by (id) into 2 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY '$' LINES TERMINATED BY '\n' STORED AS orc TBLPROPERTIES('transactional'='true'); 

sqoop import \
--connect jdbc:mysql://118.31.79.185:3306/crm_cronus \
--driver com.mysql.jdbc.Driver \
--username fang_read \
--password read2wsx \
--split-by user_id \
--query 'select user_id,name,sub_company_id,department_id,sub_company_name,department_name,team_leader_id,team_leader_name,entry_date as valid_date,'9999-12-31' as expire_date,1 as is_new from fangdb.base_user where $CONDITIONS' \
--target-dir base_user_tmp1 \
--fields-terminated-by '$' -m 3 \
--lines-terminated-by '\n'

-- load数据到base_user_tmp1
 LOAD DATA INPATH 'hdfs://sjzx1:9000/user/hadoop/base_user_tmp1' overwrite INTO TABLE base_user_tmp1;
-- 产生代理建,自增主键
insert into base_user_dim2
select rank() over(order by user_id,valid_date) as rk,a.*
from base_user_tmp1 a;
-----------------------初始化结束----------------------------------------------------------------------------------------------------
-- 修改
sqoop import \
--connect jdbc:mysql://118.31.79.185:3306/crm_cronus \
--driver com.mysql.jdbc.Driver \
--username fang_read \
--password read2wsx \
--split-by user_id \
--query 'select user_id,name,sub_company_id,department_id,sub_company_name,department_name,team_leader_id,team_leader_name,curdate() as valid_date,'9999-12-31' as expire_date,1 as is_new from fangdb.base_user where business_role in(1,2) and sub_company_id not in(42,730,811) and user_id<>1 and status=1 and $CONDITIONS' \
--target-dir base_user_tmp1 \
--fields-terminated-by '$' -m 3 \
--lines-terminated-by '\n'

-- load数据到base_user_tmp1
 LOAD DATA INPATH 'hdfs://sjzx1:9000/user/hadoop/base_user_tmp1' overwrite INTO TABLE base_user_tmp1;
 
 -- 需要更新的数据
CREATE TEMPORARY TABLE mid as
select a.id 
from base_user_dim2 a
left join base_user_tmp1 b on a.user_id=b.user_id
where a.is_new=1 and (b.user_id is null or a.sub_company_id<>b.sub_company_id or a.department_id<>b.department_id or a.team_leader_id<>b.team_leader_id);

-- 新增的数据
insert into base_user_dim2
select rank() over(order by a.user_id)+rk_max as id,a.* from(
select a.user_id,a.name,a.sub_company_id,a.department_id,a.sub_company_name,a.department_name,a.team_leader_id,a.team_leader_name, a.valid_date,a.expire_date,a.is_new 
from base_user_tmp1 a
left join base_user_dim b on a.user_id=b.user_id and b.is_new=1
where (b.user_id is null or a.sub_company_id<>b.sub_company_id or a.department_id<>b.department_id or a.team_leader_id<>b.team_leader_id )
)a cross join (select coalesce(max(id),0) as rk_max from base_user_dim) b;


-- 更新需要更新的数据,expire_date更新为前一天,is_new更新为0
update base_user_dim2  set expire_date=date_add(current_date,-1),is_new=0
where exists (select 1 from mid b where base_user_dim2.id=b.id);



-- --------  待定 第二步没有排除全部需要插入的数据,待优化------------------------------------------------
-- 更新需要更新的数据,expire_date更新为前一天,is_new更新为0,其中b.user_id is null表示已经删除的用户,
-- a.sub_company_id<>b.sub_company_id or a.department_id<>b.department_id or 
-- a.team_leader_id<>b.team_leader_id表示变更的数据

update base_user_dim2  set expire_date=date_add(current_date,-1),is_new=0
where exists (select 1 from (
select a.id 
from base_user_dim2 a
left join base_user_tmp1 b on a.user_id=b.user_id
where a.is_new=1 and (b.user_id is null or a.sub_company_id<>b.sub_company_id or a.department_id<>b.department_id or a.team_leader_id<>b.team_leader_id)
) b where base_user_dim2.id=b.id);


-- 新增的数据,其中b.user_id is null表示已经新增的用户,a.sub_company_id<>b.sub_company_id or 
-- a.department_id<>b.department_id or a.team_leader_id<>b.team_leader_id表示变更的数据

insert into base_user_dim2
select rank() over(order by a.user_id)+rk_max as id,a.* from(
select a.user_id,a.name,a.sub_company_id,a.department_id,a.sub_company_name,a.department_name,a.team_leader_id,a.team_leader_name, a.valid_date,a.expire_date,a.is_new 
from base_user_tmp1 a
left join base_user_dim2 b on a.user_id=b.user_id and b.is_new=0 and b.expire_date=date_add(current_date,-1)
where (b.user_id is null or a.sub_company_id<>b.sub_company_id or a.department_id<>b.department_id or a.team_leader_id<>b.team_leader_id )
)a cross join (select coalesce(max(id),0) as rk_max from base_user_dim) b;
 -- --------  待定 ------------------------------------------------------------------------------
 

3.2 不使用hive事务的方式

-- ---------------------初始化开始----------------------------------------------------------------------------------------------------
create table base_user_dim(
  `id` int COMMENT '客户ID',
  `user_id` int COMMENT '用户id',
  `name` string COMMENT '用户姓名',
  `sub_company_id` int COMMENT '分公司id',
  `department_id` int COMMENT '部门id',
  `sub_company_name` string COMMENT '所属分公司',
  `department_name` string COMMENT '所属部门',
  `team_leader_id` int COMMENT '团队长ID',
  `team_leader_name` string COMMENT '团队长名字',
  `valid_date` date  COMMENT '有效期',
  `expire_date` date COMMENT '到期日',
  `is_new` int COMMENT '是否最新'
)ROW FORMAT DELIMITED FIELDS TERMINATED BY '$' LINES TERMINATED BY '\n' STORED AS textfile;

create table base_user_tmp1(
  `user_id` int COMMENT '用户id',
  `name` string COMMENT '用户姓名',
  `sub_company_id` int COMMENT '分公司id',
  `department_id` int COMMENT '部门id',
  `sub_company_name` string COMMENT '所属分公司',
  `department_name` string COMMENT '所属部门',
  `team_leader_id` int COMMENT '团队长ID',
  `team_leader_name` string COMMENT '团队长名字',
  `valid_date` date  COMMENT '有效期',
  `expire_date` date COMMENT '到期日',
  `is_new` int COMMENT '是否最新'
)ROW FORMAT DELIMITED FIELDS TERMINATED BY '$' LINES TERMINATED BY '\n' STORED AS textfile;

-- 先导入到hdfs
sqoop import \
--connect jdbc:mysql://118.31.79.185:3306/crm_cronus \
--driver com.mysql.jdbc.Driver \
--username fang_read \
--password read2wsx \
--split-by user_id \
--query 'select user_id,name,sub_company_id,department_id,sub_company_name,department_name,team_leader_id,team_leader_name,entry_date as valid_date,'9999-12-31' as expire_date,1 as is_new from fangdb.base_user where $CONDITIONS' \
--target-dir base_user_tmp1 \
--fields-terminated-by '$' -m 3 \
--lines-terminated-by '\n'

-- load数据到base_user_tmp1
 LOAD DATA INPATH 'hdfs://sjzx1:9000/user/hadoop/base_user_tmp1' overwrite INTO TABLE base_user_tmp1;
-- 产生代理建,自增主键
insert into base_user_dim
select rank() over(order by user_id,valid_date) as rk,
a.*
from base_user_tmp1 a;
-- ---------------------初始化结束----------------------------------------------------------------------------------------------------
-- ---------------------定时任务开始----------------------------------------------------------------------------------------------------
hadoop fs  -rm -r /user/hadoop/base_user_tmp1
-- 修改
sqoop import \
--connect jdbc:mysql://118.31.79.185:3306/crm_cronus \
--driver com.mysql.jdbc.Driver \
--username fang_read \
--password read2wsx \
--split-by user_id \
--query 'select user_id,name,sub_company_id,department_id,sub_company_name,department_name,team_leader_id,team_leader_name,curdate() as valid_date,'9999-12-31' as expire_date,1 as is_new from fangdb.base_user where business_role in(1,2) and sub_company_id not in(42,730,811) and user_id<>1 and status=1 and $CONDITIONS' \
--target-dir base_user_tmp1 \
--fields-terminated-by '$' -m 3 \
--lines-terminated-by '\n'
-- load数据到base_user_tmp1

 LOAD DATA INPATH 'hdfs://sjzx1:9000/user/hadoop/base_user_tmp1' overwrite INTO TABLE base_user_tmp1;
 
-- 更新需要更新的数据mid1
CREATE TEMPORARY TABLE mid1 as
select a.id,a.user_id,a.name,a.sub_company_id,a.department_id,a.sub_company_name,a.department_name,a.team_leader_id,a.team_leader_name,a.valid_date,date_add(current_date,-1) as expire_date,0 as is_new 
from base_user_dim a
left join base_user_tmp1 b on a.user_id=b.user_id
where a.is_new=1 and (b.user_id is null or a.sub_company_id<>b.sub_company_id or a.department_id<>b.department_id or a.team_leader_id<>b.team_leader_id);

-- 不变的数据 hive没有in
CREATE TEMPORARY TABLE mid2 as
select * from base_user_dim a
where not exists(select 1 from mid1 b where a.id=b.id);

-- 新增的数据
CREATE TEMPORARY TABLE mid3 as
select rank() over(order by a.user_id)+rk_max as id,a.* from(
select a.user_id,a.name,a.sub_company_id,a.department_id,a.sub_company_name,a.department_name,a.team_leader_id,a.team_leader_name, a.valid_date,a.expire_date,a.is_new 
from base_user_tmp1 a
left join base_user_dim b on a.user_id=b.user_id and b.is_new=1
where (b.user_id is null or a.sub_company_id<>b.sub_company_id or a.department_id<>b.department_id or a.team_leader_id<>b.team_leader_id )
)a cross join (select coalesce(max(id),0) as rk_max from base_user_dim) b;

-- 全量覆盖目标表
INSERT overwrite table base_user_dim  select * from  mid1;
INSERT  into base_user_dim  select * from  mid2;
INSERT  into base_user_dim  select * from  mid3;
-- ---------------------定时任务结束-------------------------------------------------------------------------------------------------

 

转载于:https://my.oschina.net/u/4010291/blog/3041939

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值