uid,start_time,end_time,flow
1,2020-02-18 14:20:30,2020-02-18 14:46:30,20
1,2020-02-18 14:47:20,2020-02-18 15:20:30,30
1,2020-02-18 15:37:23,2020-02-18 16:05:26,40
1,2020-02-18 16:06:27,2020-02-18 17:20:49,50
1,2020-02-18 17:21:50,2020-02-18 18:03:27,60
2,2020-02-18 14:18:24,2020-02-18 15:01:40,20
2,2020-02-18 15:20:49,2020-02-18 15:30:24,30
2,2020-02-18 16:01:23,2020-02-18 16:40:32,40
2,2020-02-18 16:44:56,2020-02-18 17:40:52,50
3,2020-02-18 14:39:58,2020-02-18 15:35:53,20
3,2020-02-18 15:36:39,2020-02-18 15:24:54,30
sql实现代码:
import org.apache.spark.sql.{DataFrame, SparkSession}
object GuidDemoSql {
def main(args: Array[String]): Unit = {
val spark: SparkSession =SparkSession.builder()
.master(“local[*]”)
.getOrCreate()
val df: DataFrame =spark
.read
.option(“header”,“true”)
.csv(“data/liuliang.txt”)
df.createTempView("v_flow")
spark.sql(
s"""
|select
| uid,
| min(start_time),
| max(end_time ),
| sum(flow) sum_flow
| from(
| select
| uid,
| start_time,
| end_time,
| flow,
| sum(flag) over(partition by uid order by start_time) sum_flag
|from(
|select
| uid,
| start_time,
| end_time,
| flow,
| if(to_unix_timestamp(start_time,"yyyy-MM-dd HH:mm:ss")-
| to_unix_timestamp(lag_time,"yyyy-MM-dd HH:mm:ss") > 600,1,0) flag
|from(
| select
| uid,
| start_time,
| end_time,
| flow,
| lag(end_time,1,start_time) over(partition by uid order by start_time asc) lag_time
| from v_flow
|)
|)
|)group by uid,sum_flag
|""".stripMargin
).show()
}
}
本文介绍了一种使用Spark SQL对用户流量数据进行聚合分析的方法。通过对原始数据应用窗口函数和条件判断,可以有效地识别用户的连续会话,并计算每个会话的最小开始时间、最大结束时间和总流量。此方案适用于需要对大量在线行为数据进行实时或批量处理的应用场景。
329

被折叠的 条评论
为什么被折叠?



