场景
对于 APP 的使用来说,我们当然想用户每时每刻在使用我们的 APP,使用 APP 的频率越高,可以称这个用户越活跃,也可以说活跃度高。对于运营的人员来说,对于活跃度高的用户,其实不用花太大的精力去做动作来留着他们了,对于中度活跃甚至偶尔登陆一下的这些用户反而需要花大力气留住他们。所以活跃度是我们制定上线活动策略的非常重要的指标。
对于快递小哥来说,如何判断他的工作的努力程度的,我们可以使用一个星期或者一个月的连续出勤天数来衡量。
对于超时来说,比较害怕的是顾客来买东西,但是货架上没货,对于大的超市来说,需要补货的非常的多,我们可以使用最大缺货天数来衡量补货的紧急程度。其实即使缺货了,也不一定有需求,但是假设我们货物每天都有人买。
上面的这三个场景,我们可以抽象成一个问题就是在某个时间段某个事件连续发生的次数的计算。
那么把问题简化一下,求下面节假日的开始结束日期:
日期 | 是否是假日 |
---|---|
2014-01-01 | 1 |
2014-01-02 | 0 |
2014-01-03 | 0 |
2014-01-04 | 1 |
2014-01-05 | 1 |
2014-01-06 | 0 |
2014-01-07 | 0 |
2014-01-08 | 1 |
2014-01-09 | 0 |
2014-01-10 | 0 |
2014-01-11 | 1 |
2014-01-12 | 1 |
2014-01-13 | 0 |
2014-01-14 | 0 |
2014-01-15 | 0 |
2014-01-16 | 0 |
2014-01-17 | 0 |
2014-01-18 | 1 |
2014-01-19 | 1 |
2014-01-20 | 1 |
结果如下:
日期 | 是否是假日 | 开始 | 结束 |
---|---|---|---|
2014-01-01 | 1 | 2014-01-01 | 2014-01-17 |
2014-01-02 | 0 | ||
2014-01-03 | 0 | ||
2014-01-04 | 1 | 201 | 4-01-04 |
2014-01-05 | 1 | 2014-01-04 | 2014-01-05 |
2014-01-06 | 0 | ||
2014-01-07 | 0 | ||
2014-01-08 | 1 | 2014-01-08 | 2014-01-08 |
2014-01-09 | 0 | ||
2014-01-10 | 0 | ||
2014-01-11 | 1 | 2014-01-11 | 2014-01-12 |
2014-01-12 | 1 | 2014-01-11 | 2014-01-12 |
2014-01-13 | 0 | ||
2014-01-14 | 0 | ||
2014-01-15 | 0 | ||
2014-01-16 | 0 | ||
2014-01-17 | 0 | ||
2014-01-18 | 1 | 2014-01-18 | 2014-01-20 |
2014-01-19 | 1 | 2014-01-18 | 2014-01-20 |
2014-01-20 | 1 | 2014-01-18 | 2014-01-20 |
实现
实现1
with a as (
select *
from (
select '2014-01-01' as date_ , '1' as is_holaday
union all select '2014-01-02' as date_ , '0' as is_holaday
union all select '2014-01-03' as date_ , '0' as is_holaday
union all select '2014-01-04' as date_ , '1' as is_holaday
union all select '2014-01-05' as date_ , '1' as is_holaday
union all select '2014-01-06' as date_ , '0' as is_holaday
union all select '2014-01-07' as date_ , '0' as is_holaday
union all select '2014-01-08' as date_ , '1' as is_holaday
union all select '2014-01-09' as date_ , '0' as is_holaday
union all select '2014-01-10' as date_ , '0' as is_holaday
union all select '2014-01-11' as date_ , '1' as is_holaday
union all select '2014-01-12' as date_ , '1' as is_holaday
union all select '2014-01-13' as date_ , '0' as is_holaday
union all select '2014-01-14' as date_ , '0' as is_holaday
union all select '2014-01-15' as date_ , '0' as is_holaday
union all select '2014-01-16' as date_ , '0' as is_holaday
union all select '2014-01-17' as date_ , '0' as is_holaday
union all select '2014-01-18' as date_ , '1' as is_holaday
union all select '2014-01-19' as date_ , '1' as is_holaday
union all select '2014-01-20' as date_ , '1' as is_holaday
)
)
select date_
, is_holaday
, group_id
, if(is_holaday = '0', null, min(date_) over (partition by group_id)) as min_date
, if(is_holaday = '0', null, max(date_) over (partition by group_id)) as max_date
from
(
select date_
,is_holaday
, if(is_holaday='1',row_number() over (order by date_ asc)-rank() over (partition by is_holaday order by date_),0) as group_id
from a
) as x
order by date_
其实这个问题的关键在也对连续假日进行分组,这样我们就能用的 max min 取出假日的开始结束日期了。
row_number 是按照日期排序的,生成递增的序号,然后再根据 is_holaday 来对假期内外进行 rank。可以得到如下的结论。其中 D2 - D1 = 1 ,并且 D1、D2 都是假日期。
日期 | row_number 的值 | rank 的值 |
---|---|---|
D1 | n - 1 | k-1 |
D2 | n - 2 | k-2 |
不难看出 n -1
- k -1
= n -k
,n -2
- k -2
= n -k
,所以的 D1 和 D2 放到了同一个组里面。
实现2
还有一种麻烦的:
with a as (
select *
from (
select '2014-01-01' as date_ , '1' as is_holaday
union all select '2014-01-02' as date_ , '0' as is_holaday
union all select '2014-01-03' as date_ , '0' as is_holaday
union all select '2014-01-04' as date_ , '1' as is_holaday
union all select '2014-01-05' as date_ , '1' as is_holaday
union all select '2014-01-06' as date_ , '0' as is_holaday
union all select '2014-01-07' as date_ , '0' as is_holaday
union all select '2014-01-08' as date_ , '1' as is_holaday
union all select '2014-01-09' as date_ , '0' as is_holaday
union all select '2014-01-10' as date_ , '0' as is_holaday
union all select '2014-01-11' as date_ , '1' as is_holaday
union all select '2014-01-12' as date_ , '1' as is_holaday
union all select '2014-01-13' as date_ , '0' as is_holaday
union all select '2014-01-14' as date_ , '0' as is_holaday
union all select '2014-01-15' as date_ , '0' as is_holaday
union all select '2014-01-16' as date_ , '0' as is_holaday
union all select '2014-01-17' as date_ , '0' as is_holaday
union all select '2014-01-18' as date_ , '1' as is_holaday
union all select '2014-01-19' as date_ , '1' as is_holaday
union all select '2014-01-20' as date_ , '1' as is_holaday
)
) , bb as (
select date_
,is_holiday
,if(is_holiday='1' and (last_holiday is null or last_holiday = '0'),1,0) as start_holiday
,if(is_holiday='1' and (next_holiday is null or next_holiday = '0'),1,0) as end_holiday
from (
select date_
,is_holaday as is_holiday
,lag(is_holaday) over( order by date_) as last_holiday
,lead(is_holaday) over( order by date_) as next_holiday
from a
) as aa
)
select date_
,is_holiday
,start_date
,if(is_holiday = '0','' , end_date) as end_date
from (
select ee.date_
,ee.is_holiday
,ee.start_date
,dd.date_ as end_date
,row_number() over(partition by ee.date_ ) as index_
from (
select date_
,is_holiday
,if(is_holiday = '0','' , start_date) as start_date
from (
select bb.date_
,cc.date_ as start_date
,bb.is_holiday
,row_number() over(partition by bb.date_ order by cc.date_ desc) as index
from bb
cross join (
select * from bb where start_holiday = 1
) as cc
where bb.date_ >= cc.date_
order by bb.date_
)
where index = 1
) as ee cross join (
select * from bb where end_holiday = 1
) as dd
where ee.date_ <= dd.date_
)
where index_ = 1
order by date_
实现三
到了实验三我要增加难度了,有下面一些数据,不但要取出连续天数的开始和结束日期,还要求连续需要持续三天以上。
select *
from (
select * from (
select 'A' as shop,'2017-10-11' as day,300 as amt
union all select 'A' as shop,'2017-10-12' as day , 200 as amt
union all select 'B' as shop,'2017-10-11' as day , 400 as amt
union all select 'B' as shop,'2017-10-12' as day , 200 as amt
union all select 'A' as shop,'2017-10-13' as day , 100 as amt
union all select 'A' as shop,'2017-10-15' as day , 100 as amt
union all select 'C' as shop,'2017-10-11' as day , 350 as amt
union all select 'C' as shop,'2017-10-15' as day , 400 as amt
union all select 'C' as shop,'2017-10-16' as day , 200 as amt
union all select 'D' as shop,'2017-10-13' as day , 500 as amt
union all select 'E' as shop,'2017-10-14' as day , 600 as amt
union all select 'E' as shop,'2017-10-15' as day , 500 as amt
union all select 'D' as shop,'2017-10-14' as day , 600 as amt
union all select 'B' as shop,'2017-10-13' as day , 300 as amt
union all select 'C' as shop,'2017-10-17' as day , 100 as amt
union all select 'G' as shop,'2017-10-31' as day , 100 as amt
union all select 'G' as shop,'2017-11-01' as day , 100 as amt
union all select 'G' as shop,'2017-11-02' as day , 100 as amt
)
order by shop , day desc
解法如下:
select *
, first_value(day) over(partition by shop order by day) as first_day
, first_value(day) over(partition by shop order by day desc ) as first_day
from (
select *
,count(1) over(partition by shop , plus ) as coutinues_plus
from (
select *
-- 看到这里,这里是一个点睛之笔,比 row_number() - rank() 的做法有好多了。
-- 这样可以适用于日期中有断开点的,不连续的
,date_diff('day' , date('2017-01-01') , date(day))
+ row_number() over(partition by shop order by day desc ) as plus
from (
select *
from (
select 'A' as shop,'2017-10-11' as day,300 as amt
union all select 'A' as shop,'2017-10-12' as day , 200 as amt
union all select 'B' as shop,'2017-10-11' as day , 400 as amt
union all select 'B' as shop,'2017-10-12' as day , 200 as amt
union all select 'A' as shop,'2017-10-13' as day , 100 as amt
union all select 'A' as shop,'2017-10-15' as day , 100 as amt
union all select 'C' as shop,'2017-10-11' as day , 350 as amt
union all select 'C' as shop,'2017-10-15' as day , 400 as amt
union all select 'C' as shop,'2017-10-16' as day , 200 as amt
union all select 'D' as shop,'2017-10-13' as day , 500 as amt
union all select 'E' as shop,'2017-10-14' as day , 600 as amt
union all select 'E' as shop,'2017-10-15' as day , 500 as amt
union all select 'D' as shop,'2017-10-14' as day , 600 as amt
union all select 'B' as shop,'2017-10-13' as day , 300 as amt
union all select 'C' as shop,'2017-10-17' as day , 100 as amt
union all select 'G' as shop,'2017-10-31' as day , 100 as amt
union all select 'G' as shop,'2017-11-01' as day , 100 as amt
union all select 'G' as shop,'2017-11-02' as day , 100 as amt
)
order by shop , day desc
)
)
)
where coutinues_plus >= 3
只有开始和结束时间的情况
select user_name
,time_type
,lag(time_type , 1 , 0 ) over(partition by user_name order by ts) as pre_time_type
,lead(time_type , 1 , 1 )over(partition by user_name order by ts) as next_time_type
from (
select 1 as user_name , 1 as time_type, 123 as ts
union all select 1 as user_name , 1 as time_type, 126 as ts
union all select 1 as user_name , 0 as time_type, 166 as ts
union all select 1 as user_name , 0 as time_type, 167 as ts
) as a
其中,user_name 是工号, time_type 是开始和结束标识,ts 代表时间戳。
lag(field , interval , defualt_expression) 取出当前记录的向上数第 interval 记录对应的 field 的值, lead(field , interval , defualt_value) 正好相反。那么 lag 和 lead 的方向怎么分辨呢?如下图所示,从上向下看过去,向下是 lead(领先的意思),向上是 lag (落后的)。
如果想取出连续开始时间的第一条,应该使用 lag 看上一条记录应该是结束的标识,反之,如果想得到所有的结束时间的第一条数据,那可以当前 time_type 是结束,前一条是的开始。
如果想找最后一条,那可以使用 lead 的。