原始数据
id stt edt
1001 2022-05-03 2022-05-08
1001 2022-05-10 2022-05-15
1002 2022-05-03 2022-05-10
1002 2022-05-08 2022-05-15
1003 2022-05-03 2022-05-20
1003 2022-05-08 2022-05-15
1003 2022-05-17 2022-05-25
1004 2022-05-03 2022-05-20
1004 2022-05-08 2022-05-17
1004 2022-05-15 2022-05-25
思路一:不考虑时间交叉问题
1.1 计算每条数据的天数(日期相减少一天)
select
id,
(datediff(edt,stt)+1) days
from test3;t1
运行结果:
+-------+-------+
| id | days |
+-------+-------+
| 1001 | 6 |
| 1001 | 6 |
| 1002 | 8 |
| 1002 | 8 |
| 1003 | 18 |
| 1003 | 8 |
| 1003 | 9 |
| 1004 | 18 |
| 1004 | 10 |
| 1004 | 11 |
+-------+-------+
1.2按照品牌分组,计算总天数
select
id,
sum(days) days
from
(
select
id,
(datediff(edt,stt)+1) days
from test3
)t1
group by id;
查询结果: 显然不对,一个月内活动天数不会超过30天,考虑到数据可能有时间交叉问题!
+-------+-------+
| id | days |
+-------+-------+
| 1004 | 39 |
| 1002 | 16 |
| 1003 | 35 |
| 1001 | 12 |
+-------+-------+
思路二:考虑时间交叉问题(第一次的活动结束时间大于第二次活动的开始时间),将上次活动结束时间下移,若是大于第二次活动的开始时间,则将第二次活动的开始替换为下移时间+1天
2.1 将上一行数据下移
select
id,
stt,
edt,
lag(edt,1,'1970-01-01') over(partition by id order by stt) lag_dt
from test3;
查询结果:
+-------+-------------+-------------+-------------+
| id | stt | edt | lag_dt |
+-------+-------------+-------------+-------------+
| 1001 | 2022-05-03 | 2022-05-08 | 1970-01-01 |
| 1001 | 2022-05-10 | 2022-05-15 | 2022-05-08 |
| 1002 | 2022-05-03 | 2022-05-10 | 1970-01-01 |
| 1002 | 2022-05-08 | 2022-05-15 | 2022-05-10 |
| 1003 | 2022-05-03 | 2022-05-20 | 1970-01-01 |
| 1003 | 2022-05-08 | 2022-05-15 | 2022-05-20 |
| 1003 | 2022-05-17 | 2022-05-25 | 2022-05-15 |
| 1004 | 2022-05-03 | 2022-05-20 | 1970-01-01 |
| 1004 | 2022-05-08 | 2022-05-17 | 2022-05-20 |
| 1004 | 2022-05-15 | 2022-05-25 | 2022-05-17 |
+-------+-------------+-------------+-------------+
2.2比较当前行stt时间与lag_dt的大小,若是当前行大,则不变,若是lag_dt大,则用lag_dt+1替换当前行的stt
select
id,
if(stt>lag_dt,stt,date_add(lag_dt,1)) stt,
edt
from
(
select
id,
stt,
edt,
lag(edt,1,'1970-01-01') over(partition by id order by stt) lag_dt
from test3
)t1;
运行结果:
+-------+-------------+-------------+
| id | stt | edt |
+-------+-------------+-------------+
| 1001 | 2022-05-03 | 2022-05-08 |
| 1001 | 2022-05-10 | 2022-05-15 |
| 1002 | 2022-05-03 | 2022-05-10 |
| 1002 | 2022-05-11 | 2022-05-15 |
| 1003 | 2022-05-03 | 2022-05-20 |
| 1003 | 2022-05-21 | 2022-05-15 |
| 1003 | 2022-05-17 | 2022-05-25 |
| 1004 | 2022-05-03 | 2022-05-20 |
| 1004 | 2022-05-21 | 2022-05-17 |
| 1004 | 2022-05-18 | 2022-05-25 |
+-------+-------------+-------------+
2.3 计算每条数据的天数 按品牌分组,计算总天数
select
id,
sum(days) days
from
(
select
id,
(datediff(edt,stt)+1) days
from
(
select
id,
if(stt>lag_dt,stt,date_add(lag_dt,1)) stt,
edt
from
(
select
id,
stt,
edt,
lag(edt,1,'1970-01-01') over(partition by id order by stt) lag_dt
from test3
)t1
)t2
)t3
group by id;
运行结果: 只能解决两行的时间交叉问题,解决不了多行间的时间交叉问题。
+-------+-------+
| id | days |
+-------+-------+
| 1004 | 23 |
| 1002 | 13 |
| 1003 | 22 |
| 1001 | 12 |
+-------+-------+
思路三:考虑到时间跨行(多行)交叉问题
3.1 计算上无边界到当前行上一行的edt最大值(也就是截止前一天所有的活动中结束时间最大的那个)
select
id,
stt,
edt,
max(edt) over(partition by id order by stt rows between unbounded preceding and 1 preceding) max_edt
from test3;t1
查询结果:
+-------+-------------+-------------+-------------+
| id | stt | edt | max_edt |
+-------+-------------+-------------+-------------+
| 1001 | 2022-05-03 | 2022-05-08 | NULL |
| 1001 | 2022-05-10 | 2022-05-15 | 2022-05-08 |
| 1002 | 2022-05-03 | 2022-05-10 | NULL |
| 1002 | 2022-05-08 | 2022-05-15 | 2022-05-10 |
| 1003 | 2022-05-03 | 2022-05-20 | NULL |
| 1003 | 2022-05-08 | 2022-05-15 | 2022-05-20 |
| 1003 | 2022-05-17 | 2022-05-25 | 2022-05-20 |
| 1004 | 2022-05-03 | 2022-05-20 | NULL |
| 1004 | 2022-05-08 | 2022-05-17 | 2022-05-20 |
| 1004 | 2022-05-15 | 2022-05-25 | 2022-05-20 |
+-------+-------------+-------------+-------------+
3.2 比较当前行stt和前置日期(max_edt)最大值,若当前行大则不变,反之则用前置日期+1天替之
select
id,
if(max_edt is not null and max_edt>stt,date_add(max_edt,1),stt) stt,
edt
from
(
select
id,
stt,
edt,
max(edt) over(partition by id order by stt rows between unbounded preceding and 1 preceding) max_edt
from test3
)t1;
运行结果:已不存在多行交叉时间
+-------+-------------+-------------+
| id | stt | edt |
+-------+-------------+-------------+
| 1001 | 2022-05-03 | 2022-05-08 |
| 1001 | 2022-05-10 | 2022-05-15 |
| 1002 | 2022-05-03 | 2022-05-10 |
| 1002 | 2022-05-11 | 2022-05-15 |
| 1003 | 2022-05-03 | 2022-05-20 |
| 1003 | 2022-05-21 | 2022-05-15 |
| 1003 | 2022-05-21 | 2022-05-25 |
| 1004 | 2022-05-03 | 2022-05-20 |
| 1004 | 2022-05-21 | 2022-05-17 |
| 1004 | 2022-05-21 | 2022-05-25 |
+-------+-------------+-------------+
3.3 计算每条数据天数(出现负数,说明这条日期的时间被结束时间最大的那条数据完全包围)
select
id,
(datediff(edt,stt)+1) days
from
(
select
id,
if(max_edt is not null and max_edt>stt,date_add(max_edt,1),stt) stt,
edt
from
(
select
id,
stt,
edt,
max(edt) over(partition by id order by stt rows between unbounded preceding and 1 preceding) max_edt
from test3
)t1
)t2;
运行结果:负数取0即可
+-------+-------+
| id | days |
+-------+-------+
| 1001 | 6 |
| 1001 | 6 |
| 1002 | 8 |
| 1002 | 5 |
| 1003 | 18 |
| 1003 | -5 |
| 1003 | 5 |
| 1004 | 18 |
| 1004 | -3 |
| 1004 | 5 |
+-------+-------+
3.4按照品牌分组,就散总天数
select
id,
sum(if(days>0,days,0)) days
from
(
select
id,
(datediff(edt,stt)+1) days
from
(
select
id,
if(max_edt is not null and max_edt>stt,date_add(max_edt,1),stt) stt,
edt
from
(
select
id,
stt,
edt,
max(edt) over(partition by id order by stt rows between unbounded preceding and 1 preceding) max_edt
from test3
)t1
)t2
)t3
group by id;
最终结果:
+-------+-------+
| id | days |
+-------+-------+
| 1004 | 23 |
| 1002 | 13 |
| 1003 | 23 |
| 1001 | 12 |
+-------+-------+