- 这个问题可以扩展到很多相似的问题:连续几个月充值会员、连续天数有商品卖出、连续打滴滴、连续逾期。
- 测试数据:用户ID、登入日期
uid,dt
guid01,2018-02-28
guid01,2018-03-01
guid01,2018-03-02
guid01,2018-03-04
guid01,2018-03-05
guid01,2018-03-06
guid01,2018-03-07
guid02,2018-03-01
guid02,2018-03-02
guid02,2018-03-03
guid02,2018-03-06
–实现结果展示
±--------±-------±------------±------------±-+
| uid | times | start_date | end_date |
±--------±-------±------------±------------±-+
| guid01 | 4 | 2018-03-04 | 2018-03-07 |
| guid02 | 3 | 2018-03-01 | 2018-03-03 |
±--------±-------±------------±------------±-+
–实现思路
连续登陆的判断:日期相邻为1天。
*操作步骤:
* 1、我将用户分组,日期排序【升序】打上标记(为rn)
* 2、dt - rn 为 dis
* 3、我按照 用户和 dis分组,求总次数 为 counts
* *4、找出 counts 大于 3的数据
–实现代码
import cn.doit.sparksql.day01.utils.SparkUtils
import org.apache.spark.sql.expressions.{Window, WindowSpec}
import org.apache.spark.sql.{DataFrame, SparkSession}
object LoginUser {
def main(args: Array[String]): Unit = {
/** *
* - 需求:统计连续登陆的三天数和以上的用户
* - 测试数据:用户ID、登入日期
* guid01,2018-02-28
*/
val spark: SparkSession = SparkUtils.getSparkSession()
import spark.implicits._
val df: DataFrame = spark.read
.options(Map("header" -> "true", "inferSchema" -> "true"))
.csv("doc/loginUser.csv")
// df.printSchema()
// df.show()
import org.apache.spark.sql.functions._
/**
*DSL形式
*/
val win: WindowSpec = Window.partitionBy('uid).orderBy('dt)
df.select('uid, 'dt, row_number() over (win) as "rn")
// def date_sub(start: Column, days: Int)
.select('uid, 'dt, expr("date_sub(dt,rn)") as "dis")
.groupBy('uid, 'dis).agg(min('dt), max('dt), count('uid) as "counts")
.where("counts > 2").show
/**
*sparksql 形式
*/
df.createTempView("login")
spark.sql(
"""
|
| select
| uid , min(dt), max(dt) , count(1) as counts
| from
| (
| select
| uid, dt , date_sub(dt , rn) as dis
| from
| (
| select
| uid, dt , row_number() over(partition by uid order by dt) rn
| from login
| ) t1
| ) t2
| group by uid , dis having counts > 2
|
|""".stripMargin).show()
}
}