SparkSql中时间阈操作【窗口函数】_spark sql 时间窗口-CSDN博客

本文链接：https://blog.csdn.net/MrLevo520/article/details/106807052

本文深入探讨了使用SQL处理时间序列数据的多种方法，包括如何计算连续消费天数、最长连续签到时间、消费峰值日期等，适用于游戏、电商等业务场景。通过SparkSQL和HiveSQL的实例演示，帮助读者掌握复杂的时间序列分析技能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文主要总结了一些sql在时间阈上的操作，包括连续消费，最长签到，累计消费等问题，其实映射到其他业务场景也就变成了类似的计算；如游戏领域，连续登陆时间，连续签到时长，最大连续签到天数等常见的业务场景；方法都是共通的，这里就用sparksql来实现一些方法，hivesql的话有部分代码可能需要略微修改，比如having这种需要外面再套一层改成where等等就不再赘述

构造数据进行测试

为了比较好切割，我就用@进行拼凑了，第一个是日期，第二个是用户，第三个是否消费，第四个为消费金额

20190531@156@1@20
20190601@156@1@20
20190602@156@1@10
20190603@156@0@0
20190604@156@0@0
20190605@156@1@10
20190606@156@1@10
20190607@156@1@10
20190608@156@0@0
20190609@156@1@20
20190610@156@1@20
20190531@187@0@0
20190601@187@1@10
20190602@187@1@20
20190603@187@1@30
20190604@187@1@40
20190605@187@0@0
20190606@187@1@10
20190607@187@0@0
20190608@187@1@20
20190609@187@1@20
20190610@187@1@10
20190609@173@0@0
20190610@173@1@10

映射成表，如下结构

create table tmp_time_exp 
(
    dt string,  
    passenger_phone string,
    is_call string comment '是否消费',
    cost bigint comment '花费金额'
)
row format DELIMITED fields terminated by '@'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/hdfslocation'

查询一下是否符合

tmp_time_exp.dt	tmp_time_exp.passenger_phone	tmp_time_exp.is_call	tmp_time_exp.cost
20190531	156	1	20
20190601	156	1	20
20190602	156	1	10
20190603	156	0	0
20190604	156	0	0
20190605	156	1	10
20190606	156	1	10
20190607	156	1	10
20190608	156	0	0
20190609	156	1	20
20190610	156	1	20
20190531	187	0	0
20190601	187	1	10
20190602	187	1	20
20190603	187	1	30
20190604	187	1	40
20190605	187	0	0
20190606	187	1	10
20190607	187	0	0
20190608	187	1	20
20190609	187	1	20
20190610	187	1	10
20190609	173	0	0
20190610	173	1	10

常见问题

1.求n天连续消费用户

例子：如需要找到连续三天消费的用户，他的连续消费开始时间及结束时间

select
    passenger_phone,
    is_call,
    cost,
    unix_timestamp(lag(dt,2,0) over(partition by passenger_phone order by dt),'yyyyMMdd') as start_dt,
    dt as end_dt,
    datediff(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),from_unixtime(unix_timestamp(lag(dt,2,0) over(partition by passenger_phone order by dt),'yyyyMMdd'),'yyyy-MM-dd')) as last3day  
from
    tmp_time_exp
where
    is_call != 0 
having  
    last3day = 2

结果输出

passenger_phone	is_call	cost	start_dt	end_dt	last3day
156	1	10	1559232000	20190602	2
156	1	10	1559664000	20190607	2
187	1	30	1559318400	20190603	2
187	1	40	1559404800	20190604	2
187	1	10	1559923200	20190610	2

1. 在使用datediff的是时候，需要注意传递的参数必须是标准日期格式的，所以需要转化下。2. 使用lag或者lead都可以实现类似操作，首先对用户进行分组，然后对其消费时间进行排序，然后将下一个消费时间进行位移，然后做差。比较好理解，如上，将连续日期位移两个位置，如果相减为2，则这三天都是必须连续登陆的

2.用户连续消费的时间段，持续时间及该时间段消费的金额总和

举例：如156的用户，连续消费的时间段是5.31-6.2；6.5-6.7；6.9-6.10，金额为分别为50，30，40

select
    passenger_phone,
    min(dt) as start_day,
    max(dt) as end_day,
    count(1) as last_days,
    sum(cost) as cost_sum
from
(
    select
        *,
        row_number() over(partition by passenger_phone order by dt) as ranker
    from
        tmp_time_exp
    where
        is_call != 0
)a
group by
    passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)

输出结果

passenger_phone	start_day	end_day	last_days	cost_sum
156	20190531	20190602	3	50
156	20190605	20190607	3	30
156	20190609	20190610	2	40
173	20190610	20190610	1	10
187	20190601	20190604	4	100
187	20190606	20190606	1	10
187	20190608	20190610	3	50

上述的处理方式，也是参考一个blog的处理，链接找不到了，处理的很巧妙，使用日期排序的方式和自己的日期做差进行分组，如果差值都是一样的，说明是连续的日期，且这个差值相同的个数即为连续的天数

3.包括6.10，连续消费天数，断更不算（消费签到天数）

举例：156的用户。6.10消费了，往前推，6.9也消费了，但是6.8没消费，所以到目前为止连续消费的时间是2天；这个很多用于类似签到的功能，如果今天断签，则重新开始计算累计的签到天数

方法 1

select
    *
from
(
    select
        passenger_phone,
        min(dt) as start_time,
        max(dt) as end_time,
        count(1) as day_cnt
    from
    (
        select
            *,
            row_number() over(partition by passenger_phone order by dt) as ranker
        from
            tmp_time_exp
        where
            is_call = 1
    )aa
    group by
        passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)
)bb
where
    end_time = '20190610'

在问题2中，直接将结束日期限定为今日(6.10)即可得出

方法 2

with end_dt as
(
    select
        passenger_phone,
        max(dt) as end_dt
    from
        tmp_time_exp
    where
        dt between '20190531' and '20190610'
        and is_call = 0  -- 先找到最大的不消费的日期
    group by
        passenger_phone
)
select
    aa.dt,
    aa.passenger_phone,
    datediff(from_unixtime(unix_timestamp(aa.dt,'yyyyMMdd'),'yyyy-MM-dd'),from_unixtime(unix_timestamp(bb.end_dt,'yyyyMMdd'),'yyyy-MM-dd')) as day_cnt
from
(
    select
        dt,
        passenger_phone
    from
        tmp_time_exp
    where
        dt = '20190610'  -- 昨日在线用户
)aa
join
    end_dt as bb
on
    aa.passenger_phone = bb.passenger_phone

先获取每个用户最大的不消费的日期，因为从6.10开始，往前推，直到碰到第一个不消费的日期即可停止，这样就可以得出，直到6.10消费不间断的时间长度

结果都是

passenger_phone start_time      end_time        day_cnt
156	20190609	20190610	2
173	20190610	20190610	1
187	20190608	20190610	3

4.最长连续消费天数

举例：如156的用户，连续消费的时间段是5.31-6.2；6.5-6.7；6.9-6.10，时长分别为3，3，2；金额为分别为50，30，40 其实就是问题 2 的衍生。

方法1

select
    passenger_phone,
    start_day,
    end_day,
    last_days,
    rank() over(partition by passenger_phone order by last_days desc) as appose_rank, -- 包括了并列第一的情况
    row_number() over(partition by passenger_phone order by last_days desc) as last_ranker  -- 不包括并列
from
(
    select
        passenger_phone,
        min(dt) as start_day,
        max(dt) as end_day,
        count(1) as last_days
    from
    (
        select
            *,
            row_number() over(partition by passenger_phone order by dt) as ranker
        from
            tmp_time_exp
        where
            is_call != 0
    )a
    group by
        passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)
)aa
having
    -- last_ranker = 1
    appose_rank = 1

使用问题2中的解法，直接对其结果进行下一层计算即可，即直接取出连续最长的消费时长

方法2

select
    cc.*,
    length(dd) as max_length,
    row_number() over(partition by passenger_phone order by length(dd) desc) as ranker
from
(
    select
        passenger_phone,
        concat_ws('',collect(is_call)) as call_list
    from
    (
        select
            dt,
            passenger_phone,
            is_call
        from
            tmp_time_exp
        order by
            passenger_phone desc, dt desc
    )aa
    group by
        passenger_phone
)cc
lateral view explode(split(call_list,'0')) asTable as dd
having
    ranker = 1

一种比较取巧的方式，是一次面试过程中，面试官提醒我的解法，同样可以解决这个问题，但是如果需要加上日期就会稍微再复杂一些，需要前期concat一部分日期的数据，然后后期在进行解开

结果都是一致的

passenger_phone start_day       end_day last_days       appose_rank     last_ranker
156	20190531	20190602	3	1	1
156	20190605	20190607	3	1	2
173	20190610	20190610	1	1	1
187	20190601	20190604	4	1	1

5. 消费峰值日期

举例：当日消费人数最高的日期

方法1

select
    dt,
    passenger_phone,
    is_call_cnt,
    rank() over(order by is_call_cnt desc) as call_ord_ranker
from
(
    select
        *,
        sum(is_call) over(partition by dt) as is_call_cnt
    from
        tmp_time_exp
)aa
having
    call_ord_ranker = 1

方法2

select
    *,
    first_value(dt) over(order by is_call_cnt desc) as max_dt
from
(
    select
        *,
        sum(is_call) over(partition by dt) as is_call_cnt
    from
        tmp_time_exp
)aa
having
    max_dt = dt

结果

dt	passenger_phone	is_call	cost	is_call_cnt	max_dt
20190610	187	1	10	3.0	20190610
20190610	173	1	10	3.0	20190610
20190610	156	1	20	3.0	20190610

6. 消费累计到达 x 元的日期

举例：如156的用户，消费首次到达50元的日期是6.2号，首次到达100元的日期是6.9号

select
    passenger_phone,
    max(min_gt50_dt) as min_gt50_dt,
    max(min_gt100_dt) as min_gt100_dt
from
(
    select
        *,
        min(dt) over(partition by passenger_phone,if(cost_until_today >= 50,1,0)) as min_gt50_dt,
        min(dt) over(partition by passenger_phone,if(cost_until_today >= 100,1,0)) as min_gt100_dt
    from
    (
        select
            dt,
            passenger_phone,
            cost,
            sum(cost) over(partition by passenger_phone order by dt) as cost_until_today
        from
            tmp_time_exp
    )aa
)bb
group by 
    passenger_phone

结果

passenger_phone	min_gt50_dt	min_gt100_dt
156	20190602	20190609
173	20190609	20190609
187	20190603	20190604

其中比较核心的是使用了sum() over(partition by ... order by dt)语句，表示到dt为止的分组的总和，也就是累计截止的表达，对于一些分区边界的限定考虑，可以参考以下第7个问题

7. 找到某个时间区间内，消费的最大值

例子：比如一个诉求是找到6.5号前后三天中，消费金额最大的一天，这种区间性质最大值的查找，大概率都会使用窗口函数来实现，类似max() over(partition by ... order by dt rows between 3 preceding and 3 following)这种，表示了到dt这一天，往前推三天，往后推三天，也就是总共七天(包括自己)内，找到该区间内的最大值，同理把窗口聚合改成sum也就变成了该区间内的总和

select
    dt,
    passenger_phone,
    cost,
    max(cost) over(partition by passenger_phone order by dt rows between unbounded preceding and current row) as until_cur_max,
    max(cost) over(partition by passenger_phone order by dt) as until_cur_max2,  -- 效果同上
    max(cost) over(partition by passenger_phone order by dt rows between 3 preceding and 3 following) as before3later3_max,
    sum(cost) over(partition by passenger_phone order by dt rows between 3 preceding and 3 following) as before3later3_sum
from
    tmp_time_exp

结果

dt	passenger_phone	cost	until_cur_max	until_cur_max2	before3later3_max	before3later3_sum
20190531	156	20	20	20	20	50
20190601	156	20	20	20	20	50
20190602	156	10	20	20	20	60
20190603	156	0	20	20	20	70
20190604	156	0	20	20	20	60
20190605	156	10	20	20	10	40
20190606	156	10	20	20	20	50
20190607	156	10	20	20	20	70
20190608	156	0	20	20	20	70
20190609	156	20	20	20	20	60
20190610	156	20	20	20	20	50
20190609	173	0	0	0	10	10
20190610	173	10	10	10	10	10
20190531	187	0	0	0	30	60
20190601	187	10	10	10	40	100
20190602	187	20	20	20	40	100
20190603	187	30	30	30	40	110
20190604	187	40	40	40	40	110
20190605	187	0	40	40	40	120
20190606	187	10	40	40	40	120
20190607	187	0	40	40	40	100
20190608	187	20	40	40	20	60
20190609	187	20	40	40	20	60
20190610	187	10	40	40	20	50