Hive统计最近七天内连续登陆3天的用户数量

需求:统计最近七天内连续登陆3天的用户数量

欢迎关注今日头条号、微信公众号、知乎号:仰望夜空一万次

基础数据:

uid为1的用户在‘2020-09-18’号没有登陆。

val df = Seq(
  ("2020-09-21",1),
  ("2020-09-20",1),
  ("2020-09-19",1),
  ("2020-09-17",1),
  ("2020-09-16",1),
  ("2020-09-15",1),
  ("2020-09-20",2),
  ("2020-09-19",2),
  ("2020-09-20",3),
  ("2020-09-19",3),
  ("2020-09-18",3)
).toDF("dt", "uid")
df.createTempView("tmp")

第一步,使用rank over函数,根据用户分组,并且按时间排序

val sql2=
  s"""
     | select  uid, dt,rank() over(partition by uid order by dt) rank
     | from tmp
     | where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'
     |
   """.stripMargin
println(sql2)
ss.sql(sql2).show(false)

结果:
+---+----------+----+
|uid|dt        |rank|
+---+----------+----+
|1  |2020-09-15|1   |
|1  |2020-09-16|2   |
|1  |2020-09-17|3   |
|1  |2020-09-19|4   |
|1  |2020-09-20|5   |
|1  |2020-09-21|6   |
|3  |2020-09-18|1   |
|3  |2020-09-19|2   |
|3  |2020-09-20|3   |
|2  |2020-09-19|1   |
|2  |2020-09-20|2   |
+---+----------+----+

第二步,使用date_sub函数,获得date_dif的值,此值相同表示发生连续登陆行为。

s"""
   | select uid,date_sub(dt,rank) date_dif
   |  from
   |      (select  uid, dt,rank() over(partition by uid order by dt) rank
   |        from tmp
   |        where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'
   |      ) t1
 """.stripMargin
 
结果: 
 +---+----------+
|uid|date_dif  |
+---+----------+
|1  |2020-09-14|
|1  |2020-09-14|
|1  |2020-09-14|
|1  |2020-09-15|
|1  |2020-09-15|
|1  |2020-09-15|
|3  |2020-09-17|
|3  |2020-09-17|
|3  |2020-09-17|
|2  |2020-09-18|
|2  |2020-09-18|
+---+----------+
发现uid为1的用户,在最近七天内有两段连续3天登陆的行为。
第三步,获取连续3天登陆的uid。

val sql2=
  s"""
     | select uid
     |      from
     |          (
     |            select uid,date_sub(dt,rank) date_dif
     |            from
     |                 (select  uid, dt,rank() over(partition by uid order by dt) rank
     |                   from tmp
     |                   where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'
     |                 ) t1
     |          ) t2
     |     group by uid,date_dif
     |     having count(*)>=3
   """.stripMargin


结果: 
|uid|
+---+
|1  |
|1  |
|3  |
+---

第四步,对uid进行去重,此处使用group by,而非count(distinct uid),为并行分组uid提供保障。

in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE.

https://stackoverflow.com/questions/19311193/why-is-countdistinct-slower-than-group-by-in-hive

需要查看执行计划,看看版本是否支持。

|   select uid
|   from
|   (
|      select uid
|      from
|          (
|            select uid,date_sub(dt,rank) date_dif
|            from
|                 (select  uid, dt,rank() over(partition by uid order by dt) rank
|                   from tmp
|                   where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'
|                 ) t1
|          ) t2
|     group by uid,date_dif
|     having count(*)>=3
|   ) t3
|   group by uid

+---+
|uid|
+---+
|1  |
|3  |
+---+

 

第五步,获取最终数据结果。

compute_date表示统计日期,compute_period表示统计周期,continue_3_days_number表示统计周期内连续登陆3天的总用户数量。

val sql2=
  s"""
     |select
     |'2020-09-21' as compute_date,
     |concat(date_add('2020-09-21',-6),'_','2020-09-21') as compute_period,
     |count(*) as continue_3_days_number
     |from
     |(
     |   select uid
     |   from
     |   (
     |      select uid
     |      from
     |          (
     |            select uid,date_sub(dt,rank) date_dif
     |            from
     |                 (select  uid, dt,rank() over(partition by uid order by dt) rank
     |                   from tmp
     |                   where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'
     |                 ) t1
     |          ) t2
     |     group by uid,date_dif
     |     having count(*)>=3
     |   ) t3
     |   group by uid
     |) t4
   """.stripMargin


+------------+---------------------+----------------------+
|compute_date|compute_period       |continue_3_days_number|
+------------+---------------------+----------------------+
|2020-09-21  |2020-09-15_2020-09-21|2                     |
+------------+---------------------+----------------------+

 

第五步的另外一种写法,使用count(distinct)的最终写法

val sql2=
  s"""
     |select
     |'2020-09-21' as compute_date,
     |concat(date_add('2020-09-21',-6),'_','2020-09-21') as compute_period,
     |count(distinct uid) as continue_3_days_number
     |   from
     |   (
     |      select uid
     |      from
     |          (
     |            select uid,date_sub(dt,rank) date_dif
     |            from
     |                 (select  uid, dt,rank() over(partition by uid order by dt) rank
     |                   from tmp
     |                   where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'
     |                 ) t1
     |          ) t2
     |     group by uid,date_dif
     |     having count(*)>=3
     |   ) t3
   """.stripMargin
println(sql2)
ss.sql(sql2).show(false)

 

  • 11
    点赞
  • 61
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值