需求
有如下数据
id | date | val |
---|---|---|
101 | 2022-12-01 | 10 |
101 | 2022-12-03 | 20 |
101 | 2022-12-06 | 40 |
102 | 2022-12-02 | 20 |
102 | 2022-12-04 | 30 |
102 | 2022-12-07 | 60 |
期望构建如下数据:
id | final_date | val | final_value |
---|---|---|---|
101 | 2022-12-01 | 10 | 10 |
101 | 2022-12-02 | 0 | 10 |
101 | 2022-12-03 | 20 | 30 |
101 | 2022-12-04 | 0 | 30 |
101 | 2022-12-05 | 0 | 30 |
101 | 2022-12-06 | 40 | 70 |
101 | 2022-12-07 | 0 | 70 |
102 | 2022-12-01 | 0 | 0 |
102 | 2022-12-02 | 20 | 20 |
102 | 2022-12-03 | 0 | 20 |
102 | 2022-12-04 | 30 | 50 |
102 | 2022-12-05 | 0 | 50 |
102 | 2022-12-06 | 0 | 50 |
102 | 2022-12-07 | 60 | 110 |
期望数据是将已有数据补齐2022-12-01到2022-12-07号日期,没有的日期value为0,然后计算每组id内的累加value值。
分析
从期望数据角度看,咱们如果有每组id的2022-12-01到2022-12-07的数据,作为主表left join已有数据,再一个sum()over() 就很简单了,关键在于如何构造连续日期。
构造连续日期,且分成多行很容易想到利用repeat()函数,和posexplode()拆分重复的字符串,构建多行带索引的数据。再利用行索引与2022-12-01做日期计算,就能得到每组组内的01-07号多行数据
实现
1、先构建所有分组的2022-12-01、2022-12-07
with tmp as (
select '101' as id, '2022-12-01' as idate, 10 as value
union all
select '101' as id, '2022-12-03' as idate, 20 as value
union all
select '101' as id, '2022-12-06' as idate, 40 as value
union all
select '102' as id, '2022-12-02' as idate, 20 as value
union all
select '102' as id, '2022-12-04' as idate, 30 as value
union all
select '102' as id, '2022-12-07' as idate, 60 as value
)
select id,'2022-12-07' as end_time,'2022-12-01' as start_time from tmp group by id
2、再将步骤1中的2022-12-07与2022-12-01做差+1,作为repeat()函数的参数,重复多少次。得到重复字符串后,利用posexplode()列转行,将重复字符串拆分出有索引(ind)的多行,再将start_time与索引做日期计算,得到最终主表数据结果,见sql:
with tmp as (
select '101' as id, '2022-12-01' as idate, 10 as value
union all
select '101' as id, '2022-12-03' as idate, 20 as value
union all
select '101' as id, '2022-12-06' as idate, 40 as value
union all
select '102' as id, '2022-12-02' as idate, 20 as value
union all
select '102' as id, '2022-12-04' as idate, 30 as value
union all
select '102' as id, '2022-12-07' as idate, 60 as value
),
during_date as (
select id,'2022-12-07' as end_time,'2022-12-01' as start_time from tmp group by id
)
select
a.id,a.start_time,b.ind
,substr(dateadd(to_date(a.start_time,'yyyy-mm-dd'),b.ind,'dd'),1,10) as final_date
from during_date a
lateral view posexplode( split(repeat('a,',datediff(end_time,start_time)+1),',') ) b as ind,val
结果如下:
3、步骤2中的数据,咱们只需要id和final_date列即可,作为主表数据 去left join基础数据,见sql:
with tmp as (
select '101' as id, '2022-12-01' as idate, 10 as value
union all
select '101' as id, '2022-12-03' as idate, 20 as value
union all
select '101' as id, '2022-12-06' as idate, 40 as value
union all
select '102' as id, '2022-12-02' as idate, 20 as value
union all
select '102' as id, '2022-12-04' as idate, 30 as value
union all
select '102' as id, '2022-12-07' as idate, 60 as value
),
during_date as (
select id,'2022-12-07' as end_time,'2022-12-01' as start_time from tmp group by id
)
select
a.id,
a.final_date,
nvl(b.value,0) as value,
sum(nvl(b.value,0)) over(partition by a.id order by a.final_date ) as final_value
from
(select
a.id
,substr(dateadd(to_date(a.start_time,'yyyy-mm-dd'),b.ind,'dd'),1,10) as final_date
from during_date a
lateral view posexplode( split(repeat('a,',datediff(end_time,start_time)+1),',') ) b as ind,val
) a
left join
tmp b
on a.id = b.id and a.final_date = b.idate
结果如下:
最后
喜欢的点赞、关注、收藏吧~ 你的支持是最大的创作动力~~