hive上一种通用的拉链记历史方法

最新推荐文章于 2024-07-02 08:35:02 发布

筱Mary

最新推荐文章于 2024-07-02 08:35:02 发布

阅读量1.8k

点赞数 1

分类专栏： Hive 文章标签： hive 数据仓库拉链

Hive 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

数据仓库中针对历史数据的记录方法一般有3种方法
1.保存最新记录，用最新数据计算历史数据
2.快照，针对每天保留全量数据
3.拉链记历史，每天针对变化的数据记录其生命周期
上面3种各有优势，主要体现的存储空间占用、下游任务使用成本、是否能回溯历史，整体上是不同应用场景下要有不同选择。
针对拉链记历史一种通用的方法是对每条数据记录下生命周期begin_date,end_date;这咱在日常工作中最常用。在这种情况下
就要考虑两个点。一是数据由于同一条记录在整个生命周期有可能存在多条，如何存储能保证下游的使用方便和利用成本最低；
二是当这种需求多时，如何利用统一的代码来把这种需求归一；不是重复性的代码。
针对上面这两点考虑，我们在生产环境中设计思路如下
1.采用生命周期来分区，这样能有效利用数据库的分区裁剪功能，提高下游利用效率；这是就要考虑分区字段的选择，一般情况
是按天（end_date,begin_date）这样二级分区，这样利用效率最高。这里注意一定要把end_date放前面，原因是你使用时，一般
是取某一天的记录，就是end_date>某天。这里考虑到按天分区，虽然效率高，但是造成子分区过多，整个的数据碎片化严重，而且
在ODPS上导致的子目录过多，这里分区字段归化到月
2.考虑代码通用性，这里把源表名、delta表名、日期作为参数，然后读取元数据，来拼接SQL；再调用接口执行。保持了代码的封装。

一个ODPS的事例代码.注意顺序不可改
1.取有变更的记录，把end_date 换成当天,分区字段归一到当月

insert into table wcq_lc_wjs_claim_reserve_dev partition(pt1_end_month,pt2_begin_month)
select
 a.id                    
,a.reserve_type          
,a.outstanding_amount    
,a.settled_amount        
,a.remark                
,a.operator              
,a.op_date               
,a.claim_id              
,a.policy_id             
,a.is_deleted            
,a.gmt_created           
,a.creator               
,a.gmt_modified          
,a.modifier              
,a.begin_date         
,a.end_date
,TO_CHAR(DATETRUNC(DATEADD(TO_DATE('${bizdate}','yyyymmdd'),-1,'dd'),'MM'),'yyyymmdd000000')  pt1_end_month
,TO_CHAR(DATETRUNC(a.begin_date,'MM'),'yyyymmdd000000')          pt2_begin_month
from (
select 
 ,b.id                         
 ,b.reserve_type                 
 ,b.outstanding_amount           
 ,b.settled_amount               
 ,b.remark                       
 ,b.operator                     
 ,b.op_date                      
 ,b.claim_id                     
 ,b.policy_id                    
 ,b.is_deleted                   
 ,b.gmt_created                  
 ,b.creator                      
 ,b.gmt_modified                 
 ,b.modifier                     
 ,b.begin_date                   
 ,DATEADD(TO_DATE('${bizdate}','yyyymmdd'),-1,'dd')  end_date                     
from (select * from wcq_ods_wjs_claim_reserve_delta where pt='${bizdate}000000') a 
join (select * from wcq_lc_wjs_claim_reserve where pt1_end_month='30001201000000' ) b on (a.id=b.id)
)a
left outer join  (select * from wcq_lc_wjs_claim_reserve where pt1_end_month=TO_CHAR(DATETRUNC(DATEADD(TO_DATE('${bizdate}','yyyymmdd'),-1,'dd'),'MM'),'yyyymmdd000000')) b on (a.id=b.id)
where b.id is null;

2.把新增、变化的数据begin_date，改成当天。别的不改

insert overwrite table wcq_lc_wjs_claim_reserve_dev partition(pt1_end_month='30001201000000',pt2_begin_month='19000101000000')
select 
  case when  a.id  is not null then a.id                 else b.id                 end as id                            
 ,case when  a.id  is not null then a.reserve_type       else b.reserve_type       end as reserve_type                  
 ,case when  a.id  is not null then a.outstanding_amount else b.outstanding_amount end as outstanding_amount            
 ,case when  a.id  is not null then a.settled_amount     else b.settled_amount     end as settled_amount                
 ,case when  a.id  is not null then a.remark             else b.remark             end as remark                        
 ,case when  a.id  is not null then a.operator           else b.operator           end as operator                      
 ,case when  a.id  is not null then a.op_date            else b.op_date            end as op_date                     
 ,case when  a.id  is not null then a.claim_id           else b.claim_id           end as claim_id                      
 ,case when  a.id  is not null then a.policy_id          else b.policy_id          end as policy_id                     
 ,case when  a.id  is not null then a.is_deleted         else b.is_deleted         end as is_deleted                    
 ,case when  a.id  is not null then a.gmt_created        else b.gmt_created        end as gmt_created                 
 ,case when  a.id  is not null then a.creator            else b.creator            end as creator                       
 ,case when  a.id  is not null then a.gmt_modified       else b.gmt_modified       end as gmt_modified                
 ,case when  a.id  is not null then a.modifier           else b.modifier           end as modifier 
 ,case when  a.id  is not null then to_date('${bizdate}','yyyymmdd')  else b.begin_date    end as begin_date  
 ,case when  a.id  is not null then to_date('30001231','yyyymmdd')    else b.end_date      end as end_date

from (select * from wcq_ods_wjs_claim_reserve_delta where pt='${bizdate}000000') a 
full outer join (select * from wcq_lc_wjs_claim_reserve where pt1_end_month='30001201000000' and end_date>='${bizdate}' and begin_date<='${bizdate}') b on (a.id=b.id)
;

最后，如何使用

select * from wcq_lc_wjs_claim_reserve_dev 
   where pt1_end_month>=to_char(to_date('${bizdate}','yyyymmdd'),'yyyymm01000000') 
     and pt2_begin_month<=to_char(to_date('${bizdate}','yyyymmdd'),'yyyymm01000000') 
     and begin_date<=to_date('${bizdate}','yyyymmdd')
     and end_date>=to_date('${bizdate}','yyyymmdd')
     ;

筱Mary

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
hive上一种通用的拉链记历史方法

数据仓库中针对历史数据的记录方法一般有3种方法 1.保存最新记录，用最新数据计算历史数据 2.快照，针对每天保留全量数据 3.拉链记历史，每天针对变化的数据记录其生命周期上面3种各有优势，主要体现的存储空间占用、下游任务使用成本、是否能回溯历史，整体上是不同应用场景下要有不同选择。针对拉链记历史一种通用的方法是对每条数据记录下生命周期begin_date,end_date;这咱
复制链接

扫一扫

专栏目录