Fact就是数据仓库里的事实表,把事实表做成一个分区表,事实表一般数据量很大,一般根据日期做分区 生成一个订单的事实表 ================================================= 模拟订单数据========================== 模拟第一天数据 #模拟订单表数据内容: order_id,user_id,price #维表,事实表 1,1,100 2,2,200 3,3,300 #创建并上传订单数据 sudo -u hive hdfs dfs -mkdir -p /user/hive/tmp/order_update sudo -u hive hdfs dfs -put order_update.txt /user/hive/tmp/order_update/ #模拟用户表数据内容: user_id ,name,cty,st 1,pxj,wx,0 2,pl,gs,0 3,py,dz,0 4,hw,wx,0 5,yb,yc,0 6,lc,wx,0 #创建并上传用户表数据 sudo -u hive hdfs dfs -mkdir -p /user/hive/tmp/user_update sudo -u hive hdfs dfs -put user_update.txt /user/hive/tmp/user_update/ #创建订单事实表 create table fact_order (surr_user_id int, #代理用户id user_id int, #原用户id order_id int, #订单id price int #价格 ) comment 'order fact' row format delimited fields terminated by ',' lines terminated by '\n'; #sqoop从业务系统里每天导入数据到下面的表,每日订单表建一个外部表 ,ods是一个临时表,内容就是每天新增的内容 create external table ods_order_daily (order_id int, user_id int, price int ) comment 'old daily order' row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' with serdeproperties( 'field.delim'=',', 'line.delim'='\n', 'serialization.format'=',', 'serialization.encoding'='gbk') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' location 'hdfs:///user/hive/tmp/order_update'; #查询内容 hive> select * from ods_order_daily; OK 1 1 100 2 2 200 3 3 300 #sqoop从业务系统里每天导入数据到下面的表,每日用户表建一个外部表 ,ods是一个临时表,内容就是每天新增和变更的用户内容 create external table if not exists ods_user_update( user_id int, name string, cty string, st string) comment 'daily user update' row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' with serdeproperties( 'field.delim'=',', 'line.delim'='\n', 'serialization.format'=',', "serialization.encoding"='UTF-8') stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs:///user/hive/tmp/user_update/'; #查询内容 hive> select * from ods_user_update; OK 1 pxj wx 0 2 pl gs 0 3 py dz 0 4 hw wx 0 5 yb yc 0 6 lc wx 0 ==========================================上面的就实验的环境搭建,下面正式执行======================== hive shell: SET hivevar:pre_date = DATE_ADD(CURRENT_DATE(),-1); SET hivevar:max_date = CAST('2200-01-01' AS DATE); #更新用户维度表,先更新用户的属性及新增用户信息 INSERT OVERWRITE TABLE dim_user #用户表修改,新增 SELECT * FROM ( SELECT A.surr_user_id, A.user_id,A.name,a.cty,a.st,a.version, A.ver_start_date, CASE WHEN B.user_id IS NOT NULL and A.ver_end_date = ${hivevar:max_date} then ${hivevar:pre_date} ELSE A.ver_end_date END AS ver_end_date FROM dim_user AS A LEFT JOIN ods_user_update AS B ON A.user_id = B.user_id UNION select ROW_NUMBER() OVER (ORDER BY C.user_id) + D.sk_max, c.user_id,c.name,c.cty,C.st, 0, ${hivevar:pre_date} AS ver_start_date, ${hivevar:max_date} AS ver_end_date from ods_user_update as C cross join (SELECT COALESCE(MAX(surr_user_id),0) sk_max FROM dim_user) D ) AS T; hive> select * from dim_user; OK 1 1 pxj wx 0 0 2019-11-05 2200-01-01 2 2 pl gs 0 0 2019-11-05 2200-01-01 3 3 py dz 0 0 2019-11-05 2200-01-01 4 4 hw wx 0 0 2019-11-05 2200-01-01 5 5 yb yc 0 0 2019-11-05 2200-01-01 6 6 lc wx 0 0 2019-11-05 2200-01-01 #插入事实表,每天用户的订单信息 insert into fact_order(surr_user_id,user_id,order_id,price) select b.surr_user_id,b.user_id,a.order_id,a.price from ods_order_daily a join dim_user b on a.user_id=b.user_id and b.ver_end_date='2200-01-01'; 结果 hive> select * from fact_order; OK 1 1 1 100 2 2 2 200 3 3 3 300 =================================================模拟第二天的数据========================================== #模拟订单表数据内容: 4,1,320 5,6,232 6,5,223 7,3,500 #模拟用户表数据内容: 1,pxj,wx,2 2,pl,gs,2 10,aa,bb,0 12,cc,cc,0 13,dd,dd,0 ===================================清除掉第一天的数据并上传第二天的数据================== sudo -u hive hdfs dfs -rmr /user/hive/tmp/user_update/* sudo -u hive hdfs dfs -rmr /user/hive/tmp/order_update/* sudo -u hive hdfs dfs -put 20_24_1_order_update.txt /user/hive/tmp/order_update/ sudo -u hive hdfs dfs -put 20_24_1_user_update.txt /user/hive/tmp/user_update/ [root@CDH3 data]# sudo -u hive hdfs dfs -cat /user/hive/tmp/order_update/* 4,1,320 5,6,232 6,5,223 7,3,500 [root@CDH3 data]# sudo -u hive hdfs dfs -cat /user/hive/tmp/user_update/* 1,pxj,wx,2 2,pl,gs,2 10,aa,bb,0 12,cc,cc,0 13,dd,dd,0 #更新用户维度表 INSERT OVERWRITE TABLE dim_user SELECT * FROM ( SELECT A.surr_user_id, A.user_id,A.name,a.cty,a.st,a.version, A.ver_start_date, CASE WHEN B.user_id IS NOT NULL and A.ver_end_date = ${hivevar:max_date} then ${hivevar:pre_date} ELSE A.ver_end_date END AS ver_end_date FROM dim_user AS A LEFT JOIN ods_user_update AS B ON A.user_id = B.user_id UNION select ROW_NUMBER() OVER (ORDER BY C.user_id) + D.sk_max, c.user_id,c.name,c.cty,C.st, 0, ${hivevar:pre_date} AS ver_start_date, ${hivevar:max_date} AS ver_end_date from ods_user_update as C cross join (SELECT COALESCE(MAX(surr_user_id),0) sk_max FROM dim_user) D ) AS T; hive> select * from dim_user; OK 1 1 pxj wx 0 0 2019-11-05 2019-11-05 2 2 pl gs 0 0 2019-11-05 2019-11-05 3 3 py dz 0 0 2019-11-05 2200-01-01 4 4 hw wx 0 0 2019-11-05 2200-01-01 5 5 yb yc 0 0 2019-11-05 2200-01-01 6 6 lc wx 0 0 2019-11-05 2200-01-01 8 1 pxj wx 2 0 2019-11-05 2200-01-01 9 2 pl gs 2 0 2019-11-05 2200-01-01 10 10 aa bb 0 0 2019-11-05 2200-01-01 11 12 cc cc 0 0 2019-11-05 2200-01-01 12 13 dd dd 0 0 2019-11-05 2200-01-01 #更新订单事实表 insert into fact_order(surr_user_id,user_id,order_id,price) select b.surr_user_id,b.user_id,a.order_id,a.price from ods_order_daily a join dim_user b on a.user_id=b.user_id and b.ver_end_date='2200-01-01'; hive> select * from fact_order; OK 1 1 1 100 2 2 2 200 3 3 3 300 3 3 7 500 5 5 6 223 6 6 5 232 8 1 4 320
Hive数据仓库——事实表Fact、缓慢变化维SCD 应用实例
最新推荐文章于 2023-12-14 20:11:17 发布