数据分析中的连续性问题

场景

对于 APP 的使用来说,我们当然想用户每时每刻在使用我们的 APP,使用 APP 的频率越高,可以称这个用户越活跃,也可以说活跃度高。对于运营的人员来说,对于活跃度高的用户,其实不用花太大的精力去做动作来留着他们了,对于中度活跃甚至偶尔登陆一下的这些用户反而需要花大力气留住他们。所以活跃度是我们制定上线活动策略的非常重要的指标。

对于快递小哥来说,如何判断他的工作的努力程度的,我们可以使用一个星期或者一个月的连续出勤天数来衡量。

对于超时来说,比较害怕的是顾客来买东西,但是货架上没货,对于大的超市来说,需要补货的非常的多,我们可以使用最大缺货天数来衡量补货的紧急程度。其实即使缺货了,也不一定有需求,但是假设我们货物每天都有人买。

上面的这三个场景,我们可以抽象成一个问题就是在某个时间段某个事件连续发生的次数的计算。

那么把问题简化一下,求下面节假日的开始结束日期:

日期是否是假日
2014-01-011
2014-01-020
2014-01-030
2014-01-041
2014-01-051
2014-01-060
2014-01-070
2014-01-081
2014-01-090
2014-01-100
2014-01-111
2014-01-121
2014-01-130
2014-01-140
2014-01-150
2014-01-160
2014-01-170
2014-01-181
2014-01-191
2014-01-201

结果如下:

日期是否是假日开始结束
2014-01-0112014-01-012014-01-17
2014-01-020
2014-01-030
2014-01-0412014-01-04
2014-01-0512014-01-042014-01-05
2014-01-060
2014-01-070
2014-01-0812014-01-082014-01-08
2014-01-090
2014-01-100
2014-01-1112014-01-112014-01-12
2014-01-1212014-01-112014-01-12
2014-01-130
2014-01-140
2014-01-150
2014-01-160
2014-01-170
2014-01-1812014-01-182014-01-20
2014-01-1912014-01-182014-01-20
2014-01-2012014-01-182014-01-20

实现

实现1

with a as (
  select *
  from (
    select '2014-01-01' as date_ , '1' as is_holaday
    union all select '2014-01-02' as date_ , '0' as is_holaday
    union all select '2014-01-03' as date_ , '0' as is_holaday
    union all select '2014-01-04' as date_ , '1' as is_holaday
    union all select '2014-01-05' as date_ , '1' as is_holaday
    union all select '2014-01-06' as date_ , '0' as is_holaday
    union all select '2014-01-07' as date_ , '0' as is_holaday    
    union all select '2014-01-08' as date_ , '1' as is_holaday
    union all select '2014-01-09' as date_ , '0' as is_holaday
    union all select '2014-01-10' as date_ , '0' as is_holaday
    union all select '2014-01-11' as date_ , '1' as is_holaday
    union all select '2014-01-12' as date_ , '1' as is_holaday
    union all select '2014-01-13' as date_ , '0' as is_holaday    
    union all select '2014-01-14' as date_ , '0' as is_holaday
    union all select '2014-01-15' as date_ , '0' as is_holaday
    union all select '2014-01-16' as date_ , '0' as is_holaday
    union all select '2014-01-17' as date_ , '0' as is_holaday
    union all select '2014-01-18' as date_ , '1' as is_holaday
    union all select '2014-01-19' as date_ , '1' as is_holaday
    union all select '2014-01-20' as date_ , '1' as is_holaday            
  )
) 
select date_
, is_holaday
, group_id 
, if(is_holaday = '0', null, min(date_) over (partition by group_id)) as min_date
, if(is_holaday = '0', null, max(date_) over (partition by group_id)) as max_date
from 
( 
select date_
      ,is_holaday
  , if(is_holaday='1',row_number() over (order by date_ asc)-rank() over (partition by is_holaday order by date_),0) as group_id
  from a 
) as x 
order by date_

其实这个问题的关键在也对连续假日进行分组,这样我们就能用的 max min 取出假日的开始结束日期了。

row_number 是按照日期排序的,生成递增的序号,然后再根据 is_holaday 来对假期内外进行 rank。可以得到如下的结论。其中 D2 - D1 = 1 ,并且 D1、D2 都是假日期。

日期row_number 的值rank 的值
D1n - 1k-1
D2n - 2k-2

不难看出 n -1 - k -1 = n -k,n -2 - k -2 = n -k,所以的 D1 和 D2 放到了同一个组里面。

实现2

还有一种麻烦的:

with a as (
select *
  from (
    select '2014-01-01' as date_ , '1' as is_holaday
    union all select '2014-01-02' as date_ , '0' as is_holaday
    union all select '2014-01-03' as date_ , '0' as is_holaday
    union all select '2014-01-04' as date_ , '1' as is_holaday
    union all select '2014-01-05' as date_ , '1' as is_holaday
    union all select '2014-01-06' as date_ , '0' as is_holaday
    union all select '2014-01-07' as date_ , '0' as is_holaday    
    union all select '2014-01-08' as date_ , '1' as is_holaday
    union all select '2014-01-09' as date_ , '0' as is_holaday
    union all select '2014-01-10' as date_ , '0' as is_holaday
    union all select '2014-01-11' as date_ , '1' as is_holaday
    union all select '2014-01-12' as date_ , '1' as is_holaday
    union all select '2014-01-13' as date_ , '0' as is_holaday    
    union all select '2014-01-14' as date_ , '0' as is_holaday
    union all select '2014-01-15' as date_ , '0' as is_holaday
    union all select '2014-01-16' as date_ , '0' as is_holaday
    union all select '2014-01-17' as date_ , '0' as is_holaday
    union all select '2014-01-18' as date_ , '1' as is_holaday
    union all select '2014-01-19' as date_ , '1' as is_holaday
    union all select '2014-01-20' as date_ , '1' as is_holaday            
  )
) , bb as (
select date_
      ,is_holiday
      ,if(is_holiday='1' and (last_holiday is null or last_holiday = '0'),1,0) as start_holiday
      ,if(is_holiday='1' and (next_holiday is null or next_holiday = '0'),1,0) as end_holiday
 from (
select date_
      ,is_holaday as is_holiday
      ,lag(is_holaday) over( order by date_) as last_holiday
      ,lead(is_holaday) over( order by date_) as next_holiday      
 from a 
) as aa 
)
select date_
      ,is_holiday
      ,start_date
      ,if(is_holiday = '0','' , end_date) as end_date
  from (
select ee.date_
      ,ee.is_holiday
      ,ee.start_date
      ,dd.date_ as end_date
      ,row_number() over(partition by ee.date_ ) as index_
  from (
select date_
      ,is_holiday
      ,if(is_holiday = '0','' , start_date) as start_date
 from (
select bb.date_
      ,cc.date_ as start_date
      ,bb.is_holiday
      ,row_number() over(partition by bb.date_ order by cc.date_ desc) as index
  from bb 
  cross join (
    select * from bb where start_holiday = 1
  ) as cc
where bb.date_ >= cc.date_
order by bb.date_
)
where index = 1
) as ee cross join (
select * from bb where end_holiday = 1
) as dd 
where ee.date_ <= dd.date_ 
)
where index_ = 1 
order by date_

实现三

到了实验三我要增加难度了,有下面一些数据,不但要取出连续天数的开始和结束日期,还要求连续需要持续三天以上。

    select  *
      from (
          select * from (
                      select 'A' as shop,'2017-10-11' as day,300 as amt
            union all select 'A' as shop,'2017-10-12' as day , 200 as amt
            union all select 'B' as shop,'2017-10-11' as day , 400 as amt
            union all select 'B' as shop,'2017-10-12' as day , 200 as amt
            union all select 'A' as shop,'2017-10-13' as day , 100 as amt
            union all select 'A' as shop,'2017-10-15' as day , 100 as amt
            union all select 'C' as shop,'2017-10-11' as day , 350 as amt
            union all select 'C' as shop,'2017-10-15' as day , 400 as amt
            union all select 'C' as shop,'2017-10-16' as day , 200 as amt
            union all select 'D' as shop,'2017-10-13' as day , 500 as amt
            union all select 'E' as shop,'2017-10-14' as day , 600 as amt
            union all select 'E' as shop,'2017-10-15' as day , 500 as amt
            union all select 'D' as shop,'2017-10-14' as day , 600 as amt
            union all select 'B' as shop,'2017-10-13' as day , 300 as amt
            union all select 'C' as shop,'2017-10-17' as day , 100 as amt 
            
            
            union all select 'G' as shop,'2017-10-31' as day , 100 as amt 
            union all select 'G' as shop,'2017-11-01' as day , 100 as amt 
            union all select 'G' as shop,'2017-11-02' as day , 100 as amt             
    )
    order by shop , day desc 

解法如下:


select *
       , first_value(day) over(partition by shop order by day) as first_day
       , first_value(day) over(partition by shop order by day desc ) as first_day
  from (
        select * 
           ,count(1) over(partition by shop , plus ) as coutinues_plus
        from (
        select *
              -- 看到这里,这里是一个点睛之笔,比 row_number() - rank() 的做法有好多了。
              -- 这样可以适用于日期中有断开点的,不连续的
                ,date_diff('day' , date('2017-01-01') , date(day))
                + row_number() over(partition by shop order by day desc ) as plus
        from (
                select * 
                  from (
                            select 'A' as shop,'2017-10-11' as day,300 as amt
                            union all select 'A' as shop,'2017-10-12' as day , 200 as amt
                            union all select 'B' as shop,'2017-10-11' as day , 400 as amt
                            union all select 'B' as shop,'2017-10-12' as day , 200 as amt
                            union all select 'A' as shop,'2017-10-13' as day , 100 as amt
                            union all select 'A' as shop,'2017-10-15' as day , 100 as amt
                            union all select 'C' as shop,'2017-10-11' as day , 350 as amt
                            union all select 'C' as shop,'2017-10-15' as day , 400 as amt
                            union all select 'C' as shop,'2017-10-16' as day , 200 as amt
                            union all select 'D' as shop,'2017-10-13' as day , 500 as amt
                            union all select 'E' as shop,'2017-10-14' as day , 600 as amt
                            union all select 'E' as shop,'2017-10-15' as day , 500 as amt
                            union all select 'D' as shop,'2017-10-14' as day , 600 as amt
                            union all select 'B' as shop,'2017-10-13' as day , 300 as amt
                            union all select 'C' as shop,'2017-10-17' as day , 100 as amt
                            
                            
                            union all select 'G' as shop,'2017-10-31' as day , 100 as amt
                            union all select 'G' as shop,'2017-11-01' as day , 100 as amt
                            union all select 'G' as shop,'2017-11-02' as day , 100 as amt
                )
        order by shop , day desc
        )
        )
)
where coutinues_plus >= 3 

只有开始和结束时间的情况

select user_name 
       ,time_type
       ,lag(time_type , 1 , 0 ) over(partition by user_name order by ts) as pre_time_type
       ,lead(time_type , 1 , 1 )over(partition by user_name order by ts) as next_time_type
  from (
             select 1 as user_name , 1 as time_type, 123 as ts 
   union all select 1 as user_name , 1 as time_type, 126 as ts
   union all select 1 as user_name , 0 as time_type, 166 as ts
   union all select 1 as user_name , 0 as time_type, 167 as ts
) as a 

其中,user_name 是工号, time_type 是开始和结束标识,ts 代表时间戳。
lag(field , interval , defualt_expression) 取出当前记录的向上数第 interval 记录对应的 field 的值, lead(field , interval , defualt_value) 正好相反。那么 lag 和 lead 的方向怎么分辨呢?如下图所示,从上向下看过去,向下是 lead(领先的意思),向上是 lag (落后的)。

区分 lag 和 lead

如果想取出连续开始时间的第一条,应该使用 lag 看上一条记录应该是结束的标识,反之,如果想得到所有的结束时间的第一条数据,那可以当前 time_type 是结束,前一条是的开始。

如果想找最后一条,那可以使用 lead 的。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值