HQL:求出每个商品的打折天数
问题描述:
每个商品可能有多个打折活动,而打折活动的时间段可能会有重叠,求出商品实际参与打折的天数。
商品打折数据:
id stt edt
1 2021-06-01 2021-06-05
1 2021-06-07 2021-06-09
2 2021-06-01 2021-06-10
2 2021-06-05 2021-06-15
3 2021-06-01 2021-06-21
3 2021-06-05 2021-06-17
3 2021-06-15 2021-06-27
4 2021-06-01 2021-06-21
4 2021-06-01 2021-06-05
4 2021-06-07 2021-06-15
5 2021-06-01 2021-06-21
5 2021-06-05 2021-06-10
5 2021-06-08 2021-06-15
建表语句:
create table discount
(
id string,
stt string,
edt string
)
row format delimited fields terminated by '\t';
分析:
1)这里因为有时间的存在,所以我将商品的打折开始时间和结束时间合并到一列中当作流式数据处理,并且给开始时间打上开始标记,给结束时间打上结束标记,最终按照id和开始时间(stt)排序。
查询结果:
1,2021-06-01,start
1,2021-06-05,stop
1,2021-06-07,start
1,2021-06-09,stop
2,2021-06-01,start
2,2021-06-05,start
2,2021-06-10,stop
2,2021-06-15,stop
3,2021-06-01,start
3,2021-06-05,start
3,2021-06-15,start
3,2021-06-17,stop
3,2021-06-21,stop
3,2021-06-27,stop
4,2021-06-01,start
4,2021-06-01,start
4,2021-06-05,stop
4,2021-06-07,start
4,2021-06-15,stop
4,2021-06-21,stop
5,2021-06-01,start
5,2021-06-05,start
5,2021-06-08,start
5,2021-06-10,stop
5,2021-06-15,stop
5,2021-06-21,stop
子查询:
select id,
stt event_time,
'start' event_type
from discount
union all
select id,
edt event_time,
'stop' event_type
from discount
order by id, event_time;
2)根据标记进行对某个时间段内的活动个数进行统计
查询结果:
1,2021-06-01,1
1,2021-06-05,0
1,2021-06-07,1
1,2021-06-09,0
3,2021-06-01,1
3,2021-06-05,2
3,2021-06-15,3
3,2021-06-17,2
3,2021-06-21,1
3,2021-06-27,0
2,2021-06-01,1
2,2021-06-05,2
2,2021-06-10,1
2,2021-06-15,0
4,2021-06-01,2
4,2021-06-01,2
4,2021-06-05,1
4,2021-06-07,2
4,2021-06-15,1
4,2021-06-21,0
5,2021-06-01,1
5,2021-06-05,2
5,2021-06-08,3
5,2021-06-10,2
5,2021-06-15,1
5,2021-06-21,0
子查询:
select id,
event_time,
sum(`if`(event_type = 'start', 1, -1)) over (partition by id order by event_time) event_ct
from (
select id,
stt event_time,
'start' event_type
from discount
union all
select id,
edt event_time,
'stop' event_type
from discount
order by id, event_time
) t1;
3)如果event_ct为0,则那条数据对应的event_time就是活动结束时间,统计出结束时间
查询结果:
1,2021-06-01,1,
1,2021-06-05,0,2021-06-05
1,2021-06-07,1,
1,2021-06-09,0,2021-06-09
3,2021-06-01,1,
3,2021-06-05,2,
3,2021-06-15,3,
3,2021-06-17,2,
3,2021-06-21,1,
3,2021-06-27,0,2021-06-27
2,2021-06-01,1,
2,2021-06-05,2,
2,2021-06-10,1,
2,2021-06-15,0,2021-06-15
4,2021-06-01,2,
4,2021-06-01,2,
4,2021-06-05,1,
4,2021-06-07,2,
4,2021-06-15,1,
4,2021-06-21,0,2021-06-21
5,2021-06-01,1,
5,2021-06-05,2,
5,2021-06-08,3,
5,2021-06-10,2,
5,2021-06-15,1,
5,2021-06-21,0,2021-06-21
子查询:
select id,
event_time,
event_ct,
`if`(event_ct = 0, event_time, null) endtime
from (
select id,
event_time,
sum(`if`(event_type = 'start', 1, -1)) over (partition by id order by event_time) event_ct
from (
select id,
stt event_time,
'start' event_type
from discount
union all
select id,
edt event_time,
'stop' event_type
from discount
order by id, event_time
) t1
) t2;
4)给每一条数据打上一个标记,用来表示该条数据的上一个时间段有几条活动存在
查询结果:
1,2021-06-01,1,0,
1,2021-06-05,0,1,2021-06-05
1,2021-06-07,1,0,
1,2021-06-09,0,1,2021-06-09
3,2021-06-01,1,0,
3,2021-06-05,2,1,
3,2021-06-15,3,2,
3,2021-06-17,2,3,
3,2021-06-21,1,2,
3,2021-06-27,0,1,2021-06-27
2,2021-06-01,1,0,
2,2021-06-05,2,1,
2,2021-06-10,1,2,
2,2021-06-15,0,1,2021-06-15
4,2021-06-01,2,0,
4,2021-06-01,2,2,
4,2021-06-05,1,2,
4,2021-06-07,2,1,
4,2021-06-15,1,2,
4,2021-06-21,0,1,2021-06-21
5,2021-06-01,1,0,
5,2021-06-05,2,1,
5,2021-06-08,3,2,
5,2021-06-10,2,3,
5,2021-06-15,1,2,
5,2021-06-21,0,1,2021-06-21
子查询:
select id,
event_time,
event_ct,
lag(event_ct,1,0) over (partition by id order by event_time) lag_event_ct,
endtime
from (
select id,
event_time,
event_ct,
`if`(event_ct = 0, event_time, null) endtime
from (
select id,
event_time,
sum(`if`(event_type = 'start', 1, -1)) over (partition by id order by event_time) event_ct
from (
select id,
stt event_time,
'start' event_type
from discount
union all
select id,
edt event_time,
'stop' event_type
from discount
order by id, event_time
) t1
) t2
) t3;
5)如果字段lag_event_ct为0,证明该条数据之前所有的打折活动未开始或者已结束,即该条数据对应的event_time是打折活动开始时间,统计出每个商品打折活动开始时间和结束时间
查询结果:
(空白部分为null)
1,2021-06-01,
1,,2021-06-05
1,2021-06-07,
1,,2021-06-09
3,2021-06-01,
3,,
3,,
3,,
3,,
3,,2021-06-27
2,2021-06-01,
2,,
2,,
2,,2021-06-15
4,2021-06-01,
4,,
4,,
4,,
4,,
4,,2021-06-21
5,2021-06-01,
5,,
5,,
5,,
5,,
5,,2021-06-21
子查询:
select id,
if(lag_event_ct = 0, event_time, null) start_time,
endtime
from (
select id,
event_time,
event_ct,
lag(event_ct, 1, 0) over (partition by id order by event_time) lag_event_ct,
endtime
from (
select id,
event_time,
event_ct,
`if`(event_ct = 0, event_time, null) endtime
from (
select id,
event_time,
sum(`if`(event_type = 'start', 1, -1))
over (partition by id order by event_time) event_ct
from (
select id,
stt event_time,
'start' event_type
from discount
union all
select id,
edt event_time,
'stop' event_type
from discount
order by id, event_time
) t1
) t2
) t3
) t4;
6)将开始时间和结束时间合并到一行
查询结果:
3,2021-06-01,2021-06-27
3,,
5,2021-06-01,2021-06-21
5,,
1,2021-06-01,2021-06-05
1,,
1,2021-06-07,2021-06-09
1,,
2,2021-06-01,2021-06-15
2,,
4,2021-06-01,2021-06-21
4,,
子查询:
select id,
start_time,
lead(endtime) over (partition by id) end_time
from (
select id,
if(lag_event_ct = 0, event_time, null) start_time,
endtime
from (
select id,
event_time,
event_ct,
lag(event_ct, 1, 0) over (partition by id order by event_time) lag_event_ct,
endtime
from (
select id,
event_time,
event_ct,
`if`(event_ct = 0, event_time, null) endtime
from (
select id,
event_time,
sum(`if`(event_type = 'start', 1, -1))
over (partition by id order by event_time) event_ct
from (
select id,
stt event_time,
'start' event_type
from discount
union all
select id,
edt event_time,
'stop' event_type
from discount
order by id, event_time
) t1
) t2
) t3
) t4
) t5
where start_time is not null or endtime is not null;
7)过滤掉空数据
查询结果:
3,2021-06-01,2021-06-27
5,2021-06-01,2021-06-21
1,2021-06-01,2021-06-05
1,2021-06-07,2021-06-09
2,2021-06-01,2021-06-15
4,2021-06-01,2021-06-21
子查询:
select id,
start_time,
end_time
from (
select id,
start_time,
lead(endtime) over (partition by id) end_time
from (
select id,
if(lag_event_ct = 0, event_time, null) start_time,
endtime
from (
select id,
event_time,
event_ct,
lag(event_ct, 1, 0) over (partition by id order by event_time) lag_event_ct,
endtime
from (
select id,
event_time,
event_ct,
`if`(event_ct = 0, event_time, null) endtime
from (
select id,
event_time,
sum(`if`(event_type = 'start', 1, -1))
over (partition by id order by event_time) event_ct
from (
select id,
stt event_time,
'start' event_type
from discount
union all
select id,
edt event_time,
'stop' event_type
from discount
order by id, event_time
) t1
) t2
) t3
) t4
) t5
where start_time is not null
or endtime is not null
) t6
where start_time is not null;
8)按照id,开始时间,结束时间进行分组,求得每个id,每个打折活动持续时间
查询结果:
1,5
5,21
2,15
4,21
3,27
1,3
子查询:
select id,
(datediff(end_time, start_time) + 1) days
from (
select id,
start_time,
end_time
from (
select id,
start_time,
lead(endtime) over (partition by id) end_time
from (
select id,
if(lag_event_ct = 0, event_time, null) start_time,
endtime
from (
select id,
event_time,
event_ct,
lag(event_ct, 1, 0) over (partition by id order by event_time) lag_event_ct,
endtime
from (
select id,
event_time,
event_ct,
`if`(event_ct = 0, event_time, null) endtime
from (
select id,
event_time,
sum(`if`(event_type = 'start', 1, -1))
over (partition by id order by event_time) event_ct
from (
select id,
stt event_time,
'start' event_type
from discount
union all
select id,
edt event_time,
'stop' event_type
from discount
order by id, event_time
) t1
) t2
) t3
) t4
) t5
where start_time is not null
or endtime is not null
) t6
where start_time is not null
) t7
group by id, start_time, end_time;
9)将每个id的每个打折活动的持续时间相加(到这里已经没有重叠的天数了)
查询结果:
1,8
2,15
3,27
4,21
5,21
最终SQL:
select id,
sum(days) days
from (
select id,
(datediff(end_time, start_time) + 1) days
from (
select id,
start_time,
end_time
from (
select id,
start_time,
lead(endtime) over (partition by id) end_time
from (
select id,
if(lag_event_ct = 0, event_time, null) start_time,
endtime
from (
select id,
event_time,
event_ct,
lag(event_ct, 1, 0) over (partition by id order by event_time) lag_event_ct,
endtime
from (
select id,
event_time,
event_ct,
`if`(event_ct = 0, event_time, null) endtime
from (
select id,
event_time,
sum(`if`(event_type = 'start', 1, -1))
over (partition by id order by event_time) event_ct
from (
select id,
stt event_time,
'start' event_type
from discount
union all
select id,
edt event_time,
'stop' event_type
from discount
order by id, event_time
) t1
) t2
) t3
) t4
) t5
where start_time is not null
or endtime is not null
) t6
where start_time is not null
) t7
group by id, start_time, end_time
) t8
group by id
order by id;