数据分析中的连续性问题

最新推荐文章于 2024-04-26 11:30:42 发布

bluedraam_pp

最新推荐文章于 2024-04-26 11:30:42 发布

阅读量2.7k

点赞数 4

分类专栏：离线数仓文章标签： sql

本文链接：https://blog.csdn.net/bluedraam_pp/article/details/93538284

版权

离线数仓专栏收录该内容

18 篇文章 1 订阅

订阅专栏

场景

对于 APP 的使用来说，我们当然想用户每时每刻在使用我们的 APP，使用 APP 的频率越高，可以称这个用户越活跃，也可以说活跃度高。对于运营的人员来说，对于活跃度高的用户，其实不用花太大的精力去做动作来留着他们了，对于中度活跃甚至偶尔登陆一下的这些用户反而需要花大力气留住他们。所以活跃度是我们制定上线活动策略的非常重要的指标。

对于快递小哥来说，如何判断他的工作的努力程度的，我们可以使用一个星期或者一个月的连续出勤天数来衡量。

对于超时来说，比较害怕的是顾客来买东西，但是货架上没货，对于大的超市来说，需要补货的非常的多，我们可以使用最大缺货天数来衡量补货的紧急程度。其实即使缺货了，也不一定有需求，但是假设我们货物每天都有人买。

上面的这三个场景，我们可以抽象成一个问题就是在某个时间段某个事件连续发生的次数的计算。

那么把问题简化一下，求下面节假日的开始结束日期：

日期	是否是假日
2014-01-01	1
2014-01-02	0
2014-01-03	0
2014-01-04	1
2014-01-05	1
2014-01-06	0
2014-01-07	0
2014-01-08	1
2014-01-09	0
2014-01-10	0
2014-01-11	1
2014-01-12	1
2014-01-13	0
2014-01-14	0
2014-01-15	0
2014-01-16	0
2014-01-17	0
2014-01-18	1
2014-01-19	1
2014-01-20	1

结果如下：

日期	是否是假日	开始	结束
2014-01-01	1	2014-01-01	2014-01-17
2014-01-02	0
2014-01-03	0
2014-01-04	1	201	4-01-04
2014-01-05	1	2014-01-04	2014-01-05
2014-01-06	0
2014-01-07	0
2014-01-08	1	2014-01-08	2014-01-08
2014-01-09	0
2014-01-10	0
2014-01-11	1	2014-01-11	2014-01-12
2014-01-12	1	2014-01-11	2014-01-12
2014-01-13	0
2014-01-14	0
2014-01-15	0
2014-01-16	0
2014-01-17	0
2014-01-18	1	2014-01-18	2014-01-20
2014-01-19	1	2014-01-18	2014-01-20
2014-01-20	1	2014-01-18	2014-01-20

实现

实现1

with a as (
  select *
  from (
    select '2014-01-01' as date_ , '1' as is_holaday
    union all select '2014-01-02' as date_ , '0' as is_holaday
    union all select '2014-01-03' as date_ , '0' as is_holaday
    union all select '2014-01-04' as date_ , '1' as is_holaday
    union all select '2014-01-05' as date_ , '1' as is_holaday
    union all select '2014-01-06' as date_ , '0' as is_holaday
    union all select '2014-01-07' as date_ , '0' as is_holaday    
    union all select '2014-01-08' as date_ , '1' as is_holaday
    union all select '2014-01-09' as date_ , '0' as is_holaday
    union all select '2014-01-10' as date_ , '0' as is_holaday
    union all select '2014-01-11' as date_ , '1' as is_holaday
    union all select '2014-01-12' as date_ , '1' as is_holaday
    union all select '2014-01-13' as date_ , '0' as is_holaday    
    union all select '2014-01-14' as date_ , '0' as is_holaday
    union all select '2014-01-15' as date_ , '0' as is_holaday
    union all select '2014-01-16' as date_ , '0' as is_holaday
    union all select '2014-01-17' as date_ , '0' as is_holaday
    union all select '2014-01-18' as date_ , '1' as is_holaday
    union all select '2014-01-19' as date_ , '1' as is_holaday
    union all select '2014-01-20' as date_ , '1' as is_holaday            
  )
) 
select date_
, is_holaday
, group_id 
, if(is_holaday = '0', null, min(date_) over (partition by group_id)) as min_date
, if(is_holaday = '0', null, max(date_) over (partition by group_id)) as max_date
from 
( 
select date_
      ,is_holaday
  , if(is_holaday='1',row_number() over (order by date_ asc)-rank() over (partition by is_holaday order by date_),0) as group_id
  from a 
) as x 
order by date_

其实这个问题的关键在也对连续假日进行分组，这样我们就能用的 max min 取出假日的开始结束日期了。

row_number 是按照日期排序的，生成递增的序号，然后再根据 is_holaday 来对假期内外进行 rank。可以得到如下的结论。其中 D2 - D1 = 1 ,并且 D1、D2 都是假日期。

日期	row_number 的值	rank 的值
D1	n - 1	k-1
D2	n - 2	k-2

不难看出 n -1 - k -1 = n -k,n -2 - k -2 = n -k，所以的 D1 和 D2 放到了同一个组里面。

实现2

还有一种麻烦的：

with a as (
select *
  from (
    select '2014-01-01' as date_ , '1' as is_holaday
    union all select '2014-01-02' as date_ , '0' as is_holaday
    union all select '2014-01-03' as date_ , '0' as is_holaday
    union all select '2014-01-04' as date_ , '1' as is_holaday
    union all select '2014-01-05' as date_ , '1' as is_holaday
    union all select '2014-01-06' as date_ , '0' as is_holaday
    union all select '2014-01-07' as date_ , '0' as is_holaday    
    union all select '2014-01-08' as date_ , '1' as is_holaday
    union all select '2014-01-09' as date_ , '0' as is_holaday
    union all select '2014-01-10' as date_ , '0' as is_holaday
    union all select '2014-01-11' as date_ , '1' as is_holaday
    union all select '2014-01-12' as date_ , '1' as is_holaday
    union all select '2014-01-13' as date_ , '0' as is_holaday    
    union all select '2014-01-14' as date_ , '0' as is_holaday
    union all select '2014-01-15' as date_ , '0' as is_holaday
    union all select '2014-01-16' as date_ , '0' as is_holaday
    union all select '2014-01-17' as date_ , '0' as is_holaday
    union all select '2014-01-18' as date_ , '1' as is_holaday
    union all select '2014-01-19' as date_ , '1' as is_holaday
    union all select '2014-01-20' as date_ , '1' as is_holaday            
  )
) , bb as (
select date_
      ,is_holiday
      ,if(is_holiday='1' and (last_holiday is null or last_holiday = '0'),1,0) as start_holiday
      ,if(is_holiday='1' and (next_holiday is null or next_holiday = '0'),1,0) as end_holiday
 from (
select date_
      ,is_holaday as is_holiday
      ,lag(is_holaday) over( order by date_) as last_holiday
      ,lead(is_holaday) over( order by date_) as next_holiday      
 from a 
) as aa 
)
select date_
      ,is_holiday
      ,start_date
      ,if(is_holiday = '0','' , end_date) as end_date
  from (
select ee.date_
      ,ee.is_holiday
      ,ee.start_date
      ,dd.date_ as end_date
      ,row_number() over(partition by ee.date_ ) as index_
  from (
select date_
      ,is_holiday
      ,if(is_holiday = '0','' , start_date) as start_date
 from (
select bb.date_
      ,cc.date_ as start_date
      ,bb.is_holiday
      ,row_number() over(partition by bb.date_ order by cc.date_ desc) as index
  from bb 
  cross join (
    select * from bb where start_holiday = 1
  ) as cc
where bb.date_ >= cc.date_
order by bb.date_
)
where index = 1
) as ee cross join (
select * from bb where end_holiday = 1
) as dd 
where ee.date_ <= dd.date_ 
)
where index_ = 1 
order by date_

实现三

到了实验三我要增加难度了，有下面一些数据，不但要取出连续天数的开始和结束日期，还要求连续需要持续三天以上。

    select  *
      from (
          select * from (
                      select 'A' as shop,'2017-10-11' as day,300 as amt
            union all select 'A' as shop,'2017-10-12' as day , 200 as amt
            union all select 'B' as shop,'2017-10-11' as day , 400 as amt
            union all select 'B' as shop,'2017-10-12' as day , 200 as amt
            union all select 'A' as shop,'2017-10-13' as day , 100 as amt
            union all select 'A' as shop,'2017-10-15' as day , 100 as amt
            union all select 'C' as shop,'2017-10-11' as day , 350 as amt
            union all select 'C' as shop,'2017-10-15' as day , 400 as amt
            union all select 'C' as shop,'2017-10-16' as day , 200 as amt
            union all select 'D' as shop,'2017-10-13' as day , 500 as amt
            union all select 'E' as shop,'2017-10-14' as day , 600 as amt
            union all select 'E' as shop,'2017-10-15' as day , 500 as amt
            union all select 'D' as shop,'2017-10-14' as day , 600 as amt
            union all select 'B' as shop,'2017-10-13' as day , 300 as amt
            union all select 'C' as shop,'2017-10-17' as day , 100 as amt 
            
            
            union all select 'G' as shop,'2017-10-31' as day , 100 as amt 
            union all select 'G' as shop,'2017-11-01' as day , 100 as amt 
            union all select 'G' as shop,'2017-11-02' as day , 100 as amt             
    )
    order by shop , day desc

解法如下：


select *
       , first_value(day) over(partition by shop order by day) as first_day
       , first_value(day) over(partition by shop order by day desc ) as first_day
  from (
        select * 
           ,count(1) over(partition by shop , plus ) as coutinues_plus
        from (
        select *
              -- 看到这里，这里是一个点睛之笔，比 row_number() - rank() 的做法有好多了。
              -- 这样可以适用于日期中有断开点的，不连续的
                ,date_diff('day' , date('2017-01-01') , date(day))
                + row_number() over(partition by shop order by day desc ) as plus
        from (
                select * 
                  from (
                            select 'A' as shop,'2017-10-11' as day,300 as amt
                            union all select 'A' as shop,'2017-10-12' as day , 200 as amt
                            union all select 'B' as shop,'2017-10-11' as day , 400 as amt
                            union all select 'B' as shop,'2017-10-12' as day , 200 as amt
                            union all select 'A' as shop,'2017-10-13' as day , 100 as amt
                            union all select 'A' as shop,'2017-10-15' as day , 100 as amt
                            union all select 'C' as shop,'2017-10-11' as day , 350 as amt
                            union all select 'C' as shop,'2017-10-15' as day , 400 as amt
                            union all select 'C' as shop,'2017-10-16' as day , 200 as amt
                            union all select 'D' as shop,'2017-10-13' as day , 500 as amt
                            union all select 'E' as shop,'2017-10-14' as day , 600 as amt
                            union all select 'E' as shop,'2017-10-15' as day , 500 as amt
                            union all select 'D' as shop,'2017-10-14' as day , 600 as amt
                            union all select 'B' as shop,'2017-10-13' as day , 300 as amt
                            union all select 'C' as shop,'2017-10-17' as day , 100 as amt
                            
                            
                            union all select 'G' as shop,'2017-10-31' as day , 100 as amt
                            union all select 'G' as shop,'2017-11-01' as day , 100 as amt
                            union all select 'G' as shop,'2017-11-02' as day , 100 as amt
                )
        order by shop , day desc
        )
        )
)
where coutinues_plus >= 3

只有开始和结束时间的情况

select user_name 
       ,time_type
       ,lag(time_type , 1 , 0 ) over(partition by user_name order by ts) as pre_time_type
       ,lead(time_type , 1 , 1 )over(partition by user_name order by ts) as next_time_type
  from (
             select 1 as user_name , 1 as time_type, 123 as ts 
   union all select 1 as user_name , 1 as time_type, 126 as ts
   union all select 1 as user_name , 0 as time_type, 166 as ts
   union all select 1 as user_name , 0 as time_type, 167 as ts
) as a

其中，user_name 是工号， time_type 是开始和结束标识，ts 代表时间戳。
lag(field , interval , defualt_expression) 取出当前记录的向上数第 interval 记录对应的 field 的值， lead(field , interval , defualt_value) 正好相反。那么 lag 和 lead 的方向怎么分辨呢？如下图所示，从上向下看过去，向下是 lead（领先的意思），向上是 lag (落后的)。

区分 lag 和 lead