虽然已经在知乎上面提了问题,但是没有一个人回答,目前我也把结果做了出来了。
还是需要展示一下.
第一步,将半年的数据分成单月取初始化,如下只是201801的数据,其他就不展示了,改下日期就可以了:
--HIVE_CONFIG=set hive.exec.parallel=true;set mapred.max.split.size=50000000;set mapred.min.split.size.per.node=50000000;set mapred.min.split.size.per.rack=50000000;set hive.exec.reducers.bytes.per.reducer=50000000;set hive.hadoop.supports.splittable.combineinputformat=true;set hive.vectorized.execution.enabled = true;set hive.vectorized.execution.reduce.enabled = true;set mapreduce.map.memory.mb=5120;set mapreduce.reduce.memory.mb=5120;
with tmp1 as (
select cust_id,zc_bal,cast(dt as string ) as dt
,lead(zc_bal,1,NULL)over(partition by cust_id order by dt) as zc_bal1
from FACT_RPSM.F_CUST_BAL_SUM a
where dt>='20180101' and dt <='20180131' and coalesce(cust_id,'')<>''
),
tmp2 as (
SELECT A.cust_id,A.DT,A.zc_bal,zc_bal1,(CASE WHEN zc_bal=zc_bal1 THEN '29991231' ELSE DT END ) DT_END FROM tmp1 a
),
tmp3 as (
SELECT A.cust_id,A.DT,A.zc_bal,zc_bal1,DT_END,cast(cast( lag(dt,1)over(partition by cust_id order by dt)+'1' as int) as string) as DATE_2 from tmp2 a WHERE DT_END<>'29991231'
),
tmp4 as (
SELECT cust_id,zc_bal,coalesce(DATE_2, DT) AS START_DATE,DT_END AS END_DATE FROM tmp3 a
)
SELECT a.cust_id,zc_bal,START_DATE
,CASE WHEN max(END_DATE)over(partition by cust_id)=END_DATE and END_DATE='20180131' THEN '29991231' ELSE END_DATE END END_DATE
,CASE WHEN max(END_DATE)over(partition by cust_id)=END_DATE and END_DATE<'20180131' THEN '1' ELSE '0' END IS_FLAG
FROM tmp4 a
hive基本四则运算会将字符转化为decimal类型,日期相加后要先转int类型再转string类型
第二步,拼接。有sql实现拼接,但是因为数据倾斜或是其他原因,跑不出来,就只能把每月末的数据闭链,再全部整合:
--HIVE_CONFIG=set hive.exec.parallel=true;set mapred.max.split.size=50000000;set mapred.min.split.size.per.node=50000000;set mapred.min.split.size.per.rack=50000000;set hive.exec.reducers.bytes.per.reducer=50000000;set