连续问题 : rank + date_diff
间隔连续问题: 计算前一个数据量, 根据当前数据跟前一行数据的diff, 计算是否属于同一个组(是否连续flag), 累加flag得到flag_sum
根据uid, flag_sum进行分组,得到用户间隔连续的登陆次数
- 累加问题: 编写sql实现每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数
样本数据:
# 样本数据
userid,month,visits
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,1
# 创建hive表
spark.read.option("header", true).csv("/user/vc/users/vc/test/demo_4.csv").write.saveAsTable("temp.test_demo_4")
# 使用窗口函数over
select userid, month,visit_cnt_m,
sum(visit_cnt_m) over(partition by userid order by month) as visit_cnt_total,
max(visit_cnt_m) over(partition by userid order by month) as visit_cnt_m_max
from
(select userid, month, sum(visits) as visit_cnt_m from temp.test_demo_4 group by userid, month ) as t1
- 连续登陆问题:编写连续7天登录的总人数
样本数据:
uid dt login_status(1登录成功,0异常)
1 2019-07-11 1
1 2019-07-12 1
1 2019-07-13 1
1 2019-07-14 1
1 2019-07-15 1
1 2019-07-16 1
1 2019-07-17 1
1 2019-07-18 1
2 2019-07-11 1
2 2019-07-12 1
2 2019-07-13 0
2 2019-07-14 1
2 2019-07-15 1
2 2019-07-16 0
2 2019-07-17 1
2 2019-07-18 0
3 2019-07-11 1
3 2019-07-12 1
3 2019-07-13 1
3 2019-07-14 1
3 2019-07-15 1
3 2019-07-16 1
3 2019-07-17 1
3 2019-07-18 1
代码:
select uid, dt_diff, count(1) as cnt from
(select uid, dt, rnk, date_sub(dt, rnk) as dt_diff from
(select uid, dt, row_number() over(partition by uid order by dt) as rnk
from temp.test_demo_5 where login_status ='1'
group by uid, dt ) as t1 ) as t2
group by uid, dt_diff
having cnt>=7
- 间隔连续问题
当同一个用户id的本次开始时间跟上一次浏览的结束时